CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 1

Similar documents
Review: mostly probability and some statistics

Deep Learning for Computer Vision

Lecture 2: Repetition of probability theory and statistics

Computational Genomics

L2: Review of probability and statistics

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Introduction to Stochastic Processes

Bivariate distributions

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 3

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Lecture 11. Probability Theory: an Overveiw

Algorithms for Uncertainty Quantification

Probability theory. References:

Statistics for Economists Lectures 6 & 7. Asrat Temesgen Stockholm University

Review of Probability Theory

Why study probability? Set theory. ECE 6010 Lecture 1 Introduction; Review of Random Variables

Multivariate distributions

A Probability Review

f X, Y (x, y)dx (x), where f(x,y) is the joint pdf of X and Y. (x) dx

Random Variables. Cumulative Distribution Function (CDF) Amappingthattransformstheeventstotherealline.

Random Variables. Saravanan Vijayakumaran Department of Electrical Engineering Indian Institute of Technology Bombay

Recitation 2: Probability

Lecture 3: Pattern Classification

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

HW Solution 12 Due: Dec 2, 9:19 AM

STAT 430/510: Lecture 15

Probability Review. Yutian Li. January 18, Stanford University. Yutian Li (Stanford University) Probability Review January 18, / 27

Math Review Sheet, Fall 2008

1 Random Variable: Topics

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

A Probability Primer. A random walk down a probabilistic path leading to some stochastic thoughts on chance events and uncertain outcomes.

SDS 321: Introduction to Probability and Statistics

Probability Theory Review Reading Assignments

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Multiple Random Variables

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Chapter 4. Continuous Random Variables 4.1 PDF

1 Joint and marginal distributions

Brief Review of Probability

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

SUMMARY OF PROBABILITY CONCEPTS SO FAR (SUPPLEMENT FOR MA416)

Probability. Paul Schrimpf. January 23, UBC Economics 326. Probability. Paul Schrimpf. Definitions. Properties. Random variables.

Lecture Notes 5 Convergence and Limit Theorems. Convergence with Probability 1. Convergence in Mean Square. Convergence in Probability, WLLN

4. Distributions of Functions of Random Variables

STAT2201. Analysis of Engineering & Scientific Data. Unit 3

Introduction to Statistical Inference Self-study

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

conditional cdf, conditional pdf, total probability theorem?

Appendix A : Introduction to Probability and stochastic processes

Review of Probability. CS1538: Introduction to Simulations

Single Maths B: Introduction to Probability

Probability Review. Chao Lan

Bivariate Distributions

Chapter 2. Some Basic Probability Concepts. 2.1 Experiments, Outcomes and Random Variables

ECE302 Exam 2 Version A April 21, You must show ALL of your work for full credit. Please leave fractions as fractions, but simplify them, etc.

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Probability review. September 11, Stoch. Systems Analysis Introduction 1

STAT 430/510: Lecture 10

EE4601 Communication Systems

CS145: Probability & Computing

Review (Probability & Linear Algebra)

This does not cover everything on the final. Look at the posted practice problems for other topics.

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

General Random Variables

P (x). all other X j =x j. If X is a continuous random vector (see p.172), then the marginal distributions of X i are: f(x)dx 1 dx n

Probability. Paul Schrimpf. January 23, Definitions 2. 2 Properties 3

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

IAM 530 ELEMENTS OF PROBABILITY AND STATISTICS LECTURE 3-RANDOM VARIABLES

Joint Probability Distributions, Correlations

REVIEW OF MAIN CONCEPTS AND FORMULAS A B = Ā B. Pr(A B C) = Pr(A) Pr(A B C) =Pr(A) Pr(B A) Pr(C A B)

Lecture 3: Pattern Classification. Pattern classification

Multivariate Random Variable

Problem Y is an exponential random variable with parameter λ = 0.2. Given the event A = {Y < 2},

The Binomial distribution. Probability theory 2. Example. The Binomial distribution

Continuous random variables

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Bayesian Decision Theory

for valid PSD. PART B (Answer all five units, 5 X 10 = 50 Marks) UNIT I

Lecture #11: Classification & Logistic Regression

Lecture 3: Basic Statistical Tools. Bruce Walsh lecture notes Tucson Winter Institute 7-9 Jan 2013

ECE 4400:693 - Information Theory

Lecture 1: Bayesian Framework Basics

Statistical Learning Theory

Bayesian decision theory. Nuno Vasconcelos ECE Department, UCSD

Bayesian Decision Theory

Short course A vademecum of statistical pattern recognition techniques with applications to image and video analysis. Agenda

Introduction to Probability Theory

Continuous Random Variables

Review of Probabilities and Basic Statistics

Gaussian random variables inr n

2 (Statistics) Random variables

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Probability Theory for Machine Learning. Chris Cremer September 2015

ME 597: AUTONOMOUS MOBILE ROBOTICS SECTION 2 PROBABILITY. Prof. Steven Waslander

M378K In-Class Assignment #1

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

ECE531: Principles of Detection and Estimation Course Introduction

Quick Tour of Basic Probability Theory and Linear Algebra

Transcription:

CS434a/541a: Pattern Recognition Prof. Olga Veksler Lecture 1 1

Outline of the lecture Syllabus Introduction to Pattern Recognition Review of Probability/Statistics 2

Syllabus Prerequisite Analysis of algorithms (CS 340a/b) First-year course in Calculus Introductory Statistics (Stats 222a/b or equivalent) Linear Algebra (040a/b) Grading Midterm 30% Assignments 30% Final Project 40% will review 3

Syllabus Assignments bi-weekly theoretical or programming in Matlab or C no extensive programming may include extra credit work may discuss but work individually due in the beginning of the class Midterm open anything roughly on November 8 4

Syllabus Final project Choose from the list of topics or design your own May work in group of 2, in which case it is expected to be more extensive 5 to 8 page report proposals due roughly November 1 due December 8 5

Intro to Pattern Recognition Outline What is pattern recognition? Some applications Our toy example Structure of a pattern recognition system Design stages of a pattern recognition system 6

What is Pattern Recognition? Informally Recognize patterns in data More formally Assign an object or an event to one of the several pre-specified categories (a category is usually called a class) tea cup face phone 7

Application: male or female? classes Objects (pictures) male female Perfect PR system 8

Application: photograph or not? classes Objects (pictures) photo not photo Perfect PR system 9

Application: Character Recognition objects Perfect PR system h e l l o w o r l d In this case, the classes are all possible characters: a, b, c,., z 10

Application: Medical diagnostics classes objects (tumors) cancer not cancer Perfect PR system 11

Application: speech understanding objects (acoustic signal) Perfect PR system phonemes re-kig-'ni-sh&n In this case, the classes are all phonemes 12

Application: Loan applications objects (people) classes income debt married age approve deny John Smith 200,000 0 yes 80 Peter White 60,000 1,000 no 30 Ann Clark 100,000 10,000 yes 40 Susan Ho 0 20,000 no 25 13

Our Toy Application: fish sorting classifier fis hi ma ge fis hs pe c salmon ies camera sorting chamber conveyer belt sea bass 14

How to design a PR system? Collect data (training data) and classify by hand salmon sea bass salmon salmon sea bass sea bass Preprocess by segmenting fish from background Extract possibly discriminating features length, lightness,width,number of fins,etc. Classifier design Choose model Train classifier on part of collected data (training data) Test classifier on the rest of collected data (test data) i.e. the data not used for training Should classify new data (new fish images) well 15

Classifier design Notice salmon tends to be shorter than sea bass Use fish length as the discriminating feature Count number of bass and salmon of each length 2 4 8 10 12 14 bass 0 1 3 8 10 5 salmon 2 5 10 5 1 0 Count 12 10 8 6 4 2 salmon sea bass 0 2 4 8 10 12 14 Length 16

Fish length as discriminating feature Find the best length L threshold fish length < L fish length > L classify as salmon classify as sea bass For example, at L = 5, misclassified: 1 sea bass 16 salmon 2 4 8 10 12 14 bass 0 1 3 8 10 5 salmon 2 5 10 5 1 0 17 50 = 34% 17

Fish Length as discriminating feature 12 10 fish classified as salmon fish classified as sea bass Count 8 6 4 2 salmon sea bass 0 2 4 8 10 12 14 Length After searching through all possible thresholds L, the best L= 9, and still 20% of fish is misclassified 18

Next Step Lesson learned: Length is a poor feature alone! What to do? Try another feature Salmon tends to be lighter Try average fish lightness 19

Fish lightness as discriminating feature 1 2 3 4 5 bass 0 1 2 10 12 salmon 6 10 6 1 0 14 12 10 Count 8 6 salmon sea bass 4 2 0 1 2 3 4 5 Lightness Now fish are well separated at lightness threshold of 3.5 with classification error of 8% 20

lightness Can do even better by feature combining Use both length and lightness features Feature vector [length,lightness] bass decision boundary decision regions salmon length 21

lightness Better decision boundary length Ideal decision boundary, 0% classification error 22

lightness Test Classifier on New Data Classifier should perform well on new data Test ideal classifier on new data: 25% error length 23

What Went Wrong? Poor generalization complicated boundary Complicated boundaries do not generalize well to the new data, they are too tuned to the particular training data, rather than some true model which will separate salmon from sea bass well. This is called overfitting the data 24

Generalization training data testing data Simpler decision boundary does not perform ideally on the training data but generalizes better on new data Favor simpler classifiers William of Occam (1284-1347): entities are not to be multiplied without necessity 25

Pattern Recognition System Structure input domain dependent camera, microphones, medical imaging devices, etc. Patterns should be well separated and should not overlap. Extract discriminating features. Good features make the work of classifier easy. sensing segmentation feature extraction Use features to assign the object to a category. Better classifier makes feature extraction easier. Our main topic in this course Exploit context (input depending information) to improve system performance Tne cat The cat classification post-processing decision 26

How to design a PR system? start collect data prior knowledge choose features choose model train classifier evaluate classifier end 27

Design Cycle cont. start Collect Data Can be quite costly How do we know when we have collected an adequately representative set of testing and training examples? collect data choose features choose model train classifier evaluate classifier end 28

Design Cycle cont. Choose features Should be discriminating, i.e. similar for objects in the same category, different for objects in different categories: start collect data choose features good features: bad features: choose model Prior knowledge plays a great role (domain dependent) Easy to extract Insensitive to noise and irrelevant transformations train classifier evaluate classifier end 29

Design Cycle cont. Choose model What type of classifier to use? When should we try to reject one model and try another one? What is the best classifier for the problem? start collect data choose features choose model train classifier evaluate classifier end 30

Design Cycle cont. Train classifier Process of using data to determine the parameters of classifier Change parameters of the chosen model so that the model fits the collected data Many different procedures for training classifiers Main scope of the course start collect data choose features choose model train classifier evaluate classifier end 31

Design Cycle cont. Evaluate Classifier measure system performance Identify the need for improvements in system components How to adjust complexity of the model to avoid overfitting? Any principled methods to do this? Trade-off between computational complexity and performance start collect data choose features choose model train classifier evaluate classifier end 32

Conclusion useful a lot of exciting and important applications but hard must solve many issues for a successful pattern recognition system 33

Review: mostly probability and some statistics 34

Content Probability Axioms and properties Conditional probability and independence Law of Total probability and Bayes theorem Random Variables Discrete Continuous Pairs of Random Variables Random Vectors Gaussian Random Variable 35

Basics We are performing a random experiment (catching one fish from the sea) S: all fish in the sea event A total number of events: 2 12 all events in S P A P(A) probability events

Axioms of Probability 1. 2. P(A) P(S) 0 = 1 3. If AB = then P ( A B) = P( A) + P( B) 37

Properties of Probability P( ) = 0 P(A) 1 P( A c ) = 1 P( A) A B P( A) < P( B) P( AB) = P( A) + P( B) P( AB) N k= 1 { A } ( ) i Aj =, i, j P A k = P Ak N k= 1 38

Conditional Probability If A and B are two events, and we know that event B has occurred, then (if P(B)>0) P(A B) P(A B)= P(B) S U A A B U B B occurred A B U B!! " U multiplication rule P(A B)= U P(A B) P(B) 39

Independence A and B are independent events if P(A B) = P(A) P(B) U By the law of conditional probability, if A and B are independent P(A) P(B) P(A B) = = P(A) P(B) If two events are not independent, then they are said to be dependent 40

Law of Total Probability B, B,,B partition S 1 2 n B 1 Consider an event A Thus B 2 A = U U U P A U B 1 A U B 2 A U B 3 A A U B 3 B 4 B 4 ( ) ( ) ( ) ( ) ( ) A = P A B + Or using multiplication rule: 1 + P AB 2 + P AB 3 P AB 4 ( ) ( ) ( ) ( ) ( ) P A = P A B + 1 P B1 + P A B4 P B4 n ( ) = P( A B ) ( ) k P B k P A k= 1

Bayes Theorem Let B, B,, B, be a partition of the sample space S. Suppose event A occurs. What is the probability of event B? Answer: Bayes Rule P ( B A) i ( Bi A) P ( A) P P = = n P k = 1 ( A B ) P ( B ) i ( A B ) P ( B ) k i k # 42

Random Variables In random experiment, usually assign some number to the outcome, for example, number of of fish fins A random variable X is a function from sample sample space S to a real number. $ $ & % X is random due to randomness of its argument (# of fins) ( ) ( ( ) ) ( ( ) ) P X = a = P X ω = a = P ω Ω X ω = a

Two Types of Random Variables Discrete random variable has countable number of values number of fish fins (0,1,2,.,30) Continuous random variable has continuous number of values fish weight (any real number between 0 and 100)

Cumulative Distribution Function Given a random variable X, CDF is defined as F ( a ) = P ( X a ) "! # " #!

Properties of CDF 1. F(a) is non decreasing 2. 3. lim b lim b F F ( b) = 1 ( b) = 0 F ( ) ( ) a = "! # " # P X a! Questions about X can be asked in terms of CDF P ( a < X b) = F( b) F( a) Example: P( ' ( " ) ( )=F(30)-F(20)

Discrete RV: Probability Mass Function Given a discrete random variable X, we define the probability mass function as ( ) p ( a) = P X = Satisfies all axioms of probability CDF in discrete case satisfies ( X a) = P( X = a = p( a) F ( a) = P ) x a a x a 47

Continuous RV: Probability Density Function Given a continuous RV X, we say f(x) is its probability density function if F( a) = P( X a) = f ( x)dx a and, more generally ( X b) = f ( x)dx P a b a 48

Properties of Probability Density Function P P d dx F a ( X = a) = f ( x) dx = 0 a ( X ) = f ( x) dx = 1 f ( ) ( ) x = f ( x) 0 x 49

probability mass probability density pmf 0.4 0.3 1 pdf 0.6 1 1 2 3 4 5! $ true probability P(fish has 2 or 3 fins)= =p(2)+p(3)=0.3+0.4 take sums density, not probability P(fish weights 30kg) 0.6 P(fish weights 30kg)=0 P(fish weights between 29 and 31kg)= 31 f ( x) dx integrate 29

Expected Value Useful characterization of a r.v. Also known as mean, expectation, or first moment discrete case: continuous case: µ = E = ( ) ( ) Expectation can be thought of as the average or the center, or the expected average outcome over many experiments X ( ) x x p x µ = E X = x f( x) dx 51

Expected Value for Functions of X Let g(x) be a function of the r.v. X. Then discrete case: continuous case: = x [ g( X) ] g( x p( x) E ) [ g( X) ] g( x) E = f( x) dx An important function of X: [X-E(X)] 2 2 2 Variance E[[X-E(X)] E(X)] ] = var(x)= (X)=σ Variance measures the spread around the mean Standard deviation = [Var(X)] 1/2, has the same units as the r.v. X 52

Properties of Expectation If X is constant r.v. X=c, then E(X) = c If a and b are constants, E(aX+b)=aE(X)+b More generally, E ( ) n ( ) n a X + c = ( a E( X ) + c ) = 1 i i i i = i 1 i i i If a and b are constants, then 2 var(ax+b)= a var(x) 53

Pairs of Random Variables Say we have 2 random variables: Fish weight X Fish lightness Y Can define joint CDF F( a,b) = P( X a,y b) = P ω Ω X( ω) a,y ( ω) b ( ) Similar to single variable case, can define discrete: joint probability mass function ( ) ( ) p a, b = P X = ay, = continuous: joint density function ( ) ( ) P a X b, c Y d = f x, y b a x c y b d ( ) f x, y dxdy 54

Marginal Distributions given joint mass function p x,y(a,b), marginal, i.e. probability mass function for r.v. X can be obtained from p (a,b) x x,y ( a) = p ( ) x, y a y p ( ) = ( ) y b px, y x, b p, y x marginal densities f x(x) and f y(y) are obtained from joint density f (x,y) by integrating x,y f y= x ( x) = f ( x, y)dy y= x, y f x= y ( y) = f ( x, y)dx x= x, y 55

Independence of Random Variables r.v. X and Y are independent if P ( ) ( ) ( ) X x, Y y = P X x P Y y Theorem: r.v. X and Y are independent if and only if p ( ) ( ) ( ), y x, y py y px x (discrete) x = ( ) ( ) ( ) fx, y x, y = fy y fx x (continuous) 56

More on Independent RV s If X and Y are independent, then E(XY)=E(X)E(Y) Var(X+Y)=Var(X)+Var(Y) G(X) and H(Y) are independent 57

Covariance Given r.v. X and Y, covariance is defined as: cov ( ) [( ( ))( ( ))] ( ) ( ) ( ) X, Y = E X E X Y E Y = E XY E X E Y Covariance is useful for checking if features X and Y give similar information Covariance (from co-vary) indicates tendency of X and Y to vary together If X and Y tend to increase together, Cov(X,Y) > 0 If X tends to decrease when Y increases, Cov(X,Y) < 0 If decrease (increase) in X does not predict behavior of Y, Cov(X,Y) is close to 0 58

Covariance Correlation If cov(x,y) = 0, then X and Y are said to be uncorrelated (think unrelated). However X and Y are not necessarily independent. If X and Y are independent, cov(x,y) = 0 Can normalize covariance to get correlation ( X, Y) ( ) ( ) cov 1 cor( X, Y) = 1 var X var Y 59

Random Vectors Generalize from pairs of r.v. to vector of r.v. X= [X X X ] (think multiple features) 1 2 3 Joint CDF, PDF, PMF are defined similarly to the case of pair of r.v. s Example: F ( ) ( ) x 1, x2,..., xn P X1 x1, X2 x2 =,..., X All the properties of expectation, variance, covariance transfer with suitable modifications n x n 60

Covariance Matrix characteristics summary of random vector T cov(x)=cov[x X X ] = Σ =E[(X- µ)(x- µ) ]= 1 2 n E(X µ )(X µ ) 1 1 1 1 E(X µ )(X µ ) 2 2 1 1 E(X µ )(X µ ) n n 1 1 E(X µ )(X µ ) n n 1 1 E(X µ )(X µ ) n n 2 2 E(X µ )(X µ ) n n n n variances σ 2 1 c 21 c 12 c 13 σ 2 2 c 31 c 23 c 32 σ 2 3 covariances 61

Normal or Gaussian Random Variable Has density f ( ) x = σ 1 x µ 1 2 σ e 2π 2 Mean µ, and variance σ 2 62

Multivariate Gaussian has density f ( ) x = ( ) 2π 1 n/ 2 1/ 2 e 1 2 [( ) ( )] 1 x µ x µ mean vector covariance matrix [ µ ] µ =, 1, µ n 63

Why Gaussian? Frequently observed (Central limit theorem) Parameters µ and Σ are sufficient to characterize the distribution Nice to work with Marginal and conditional distributions also are gaussians If X s are uncorrelated then they are also i independent 64

Summary Intro to Pattern Recognition Review of Probability and Statistics Next time will review linear algebra 65