INTRODUCTION TO PATTERN

Size: px

Start display at page:

Download "INTRODUCTION TO PATTERN"

Lorena Carson
5 years ago
Views:

1 INTRODUCTION TO PATTERN RECOGNITION INSTRUCTOR: WEI DING 1 Pattern Recognition Automatic discovery of regularities in data through the use of computer algorithms With the use of these regularities to take actions such as classifying the data into different categories 2

2 Example Handwritten Digit Recognition Each digit corresponds to a 28 by 28 pixel image. How would you formulate the problem? 3 Example: Problem Formulation Each digit corresponds to 28*28 pixel image and so can be represented by a vector x comprising 744 real numbers. The goal is to build a machine that will take such a vector x as input and that will produce the identity of the digit 0,, 9 as the output. This is a nontrivial iil problem due to the wide variability of handwriting. i Using handcrafted rules? Heuristics for distinguishing the digits based on the shapes of strokes? 4

3 Example: Training i St& Set Test tst& Set Generalization Training i set: The categories of the digits it in the training i set are known in advance We can express the category of a digit using target vector t, which represents the identify of the corresponding digit. Test set: New digital unseen images Generalization: a machine learning approach to learn a function y(x) that categorize correctly new examples that differ from those used for training In practical applications, the variability of the input vectors will be such that the training data can comprise only a tiny fraction of all possible input vectors, and so generalization is a central goal in pattern recognition. 5 Feature Extraction Find useful features that are fast to compute, and yet that also preserve useful discriminatory information. A form of dimensionality reduction 6

4 Supervised Learning: Classification & Regression Applications i in which h the training i data comprises examples of the input vectors along with their corresponding target tvectors are known as supervised learning problems. Cases such as the digit i recognition ii example, in which the aim is to assign each input vector on one of a finite number of discretecategories categories, are called classification problem. In the desired output consists of one or more continuous variables, then the task is called regression. 7 Polynomial CurveFitting (Regression) In real data set, individual observations are often corrupted by random noise. Plot of a training data set of N=10 points, shown as blue circles, each comprising an observation of the input variable x along with the corresponding target variable t. The green curve shows the function sin(2πx) used to generate the data. Goal: to predict the value of t for some new value of x, without knowledge of the green curve. 8

5 Polynomial Curve Fitting M is the order of the polynomial X j denote x raised to the power of j The polynomial coefficients 1,, w m are collectively denoted by the vector w Thepolynomial functiony(x y(x,w) is a nonlinear function of x, it is a linear function of the coefficients w. Linear models: functions are linear in the unknown parameters are called linear models. 9 Error Function The sum of the squares of the errors between the predictions y(x n,w) for each data point x n and the corresponding target values t n The values of the coefficients will be determined by fitting the polynomial to the training data This can be done by minimizing an error function that measure the misfit between the function y(x,w), for any given value of w, and the training set data points. 10

6 Sum of Squares Squares Error Function When E(w)=0? 11 Solving the Curve Fitting Problem Choosing the value of w for which h E(w) is as small as possible Becausethe error functionisis a quadratic function of the coefficients w, its derivatives with respect to the coefficients will be linear in the elements of w, and so the minimization of the error function has a unique solution, denoted by w*, which can be found in closed form. The resulting polynomial lis given by the function y(x, w*). 12

7 0 th Order Polynomial Poor fitting 13 1 st Order Polynomial Poor fitting 14

8 3 rd Order Polynomial Excellent fitting 15 9 th Order Polynomial Over fitting The polynomial passes exactly through each data point and E(w*)=0. However, the fitted curve oscillates widely and gives a vary representation of the function sin(2πx) 16

9 Root Mean Square (RMS) Error Root Mean Square (RMS) Error: Each choice of M, we can evaluate error for the test data set. RMS for the test data set. Division by N allows us to compare different sizes of data sets on an equal footing the square root ensures that E RMS is measured as the same scale (and in the same units) as the target variable t. 17 Over fitting Graphs of the training and test set RMS errors are shown, for various value of M. The test set error is a measure of how well we are doing in predicting the values of t for new data observations of x. Root Mean Square (RMS) Error: Values of M in the range 3<=M<=8 give small values for the test set error, and these also give reasonable representations of the generating function sin(2πx) 18

10 Over Fitting For M=9, the training set error goes to zero. Because this polynomial contains 10 degrees of freedom corresponding to the 10 coefficients w 0,,w 9, and so can be tuned exactly to the 10 data points in the training set. However, the test set error has become very large and the corresponding function y(x,w*) exhibits wild oscillations. 19 Paradox A polynomial lof given order contains all lower order polynomials as special case TheM M=9 polynomial is thereforecapable of generating results at least as good as the M=3 polynomial. The best predicator of new data would be the functionsin(2πx) contains terms of all orders. We know that a power series expansion of the function sin(2πx) contains terms of all orders, so we might expect that results should improve monotonically as we increase M 20

11 Polynomial Coefficients The more flexible polynomials with larger values of M are becoming increasingly tuned to the random noise on the target values. 21 It is also interesting to examine the behavior of a given model as the size of the data set is varied 22

12 Data Set Size: 9 th Order Polynomial 23 Data Set Size: 9 th Order Polynomial So what is your conclusion? 24

13 Over fitting vs. size of the data set For a given model complexity, the over fitting problem become less severe as the size of data set increases. The larger the data set, the more complex (in other words more flexible) the model that we can afford to fit to the data. One rough heuristic that is sometimes advocated is that the number of data points should be no less than some multiple (say 5 or 10) of the number of adaptive parameters in the model. not necessarily though There is something rather unsatisfying about having to limit the # of parameters in a model according the size of the available training set. It would seem more reasonable to choose the complexity of the model according the complexity of the problem being solved. 25 Overfitting (later in this semester) Maximum likelihood. The over fitting problem can be understood as a general property of maximum likelihood. Bayesian approach. Thereis no difficulty from a Bayesian perspective in employing models for which the number of parameters greatly exceeds the number of data points. By adopting a Bayesian approach, the over fitting problem can be avoided. 26

14 For the moment, it is instructive to continue with the current approach and to consider how in practice we can apply it to data sets of limited size where we may wish to use relatively complex and flexible models. 27 Regularization Penalize large coefficient values λ governs the relative importance of the regularization term compared with the sum of squares error term ground truth w T w=w 2 2 +w 2 0 +w 1 + +w M predications A penalty term to discourage the coefficients from reaching large values 28

15 Regularization: and M=9 M9 29 Regularization: 30

16 Polynomial Coefficients The regularization has the desired effect of reducing the magnitude of the coefficients. 31 Regularization: vs. The impact of the regularization term on the generalization error can be seen by plotting the value of the RMS error for both training and test sets against lnλ. In effect λ now controls the effective complexity of the model and hence determines the degree of over fitting 32

17 Model Complexity To solve a practical application i using this approach of minimizing an error function, we would have to find a way to determine dt a suitable value for the model complexity. What we have discussed is a simple way of achieving this taking the available data and partitioning ii i it into a training i set, used to determine the coefficient w, and a separate validation set, also called hold out set, used to optimize the model complexity (either M or λ) Too wasteful of valuable training data 33 Probability Theory A key concept in the field of pattern recognition is that of uncertainty. Arising through noise on measurement and the finite size of data sets Probability theory provides a consistent framework for the quantification and manipulation of uncertainty and forms one of the central foundations for pattern recognition. 34

ProbabilityTheory (example 1/4) Apples and Oranges We randomly pick one of the boxes and from the box we randomly select an item of fruit, and having observed which sort of fruit it is we replace it

Let us suppose that in so doing we pick the red box 40% of the time and we pick the blue box 60% of the time, and that when we remove an item of fruit from a box we are equally likely to select any

18 ProbabilityTheory (example 1/4) Apples and Oranges We randomly pick one of the boxes and from the box we randomly select an item of fruit, and having observed which sort of fruit it is we replace it in the box from which it came. Let us suppose that in so doing we pick the red box 40% of the time and we pick the blue box 60% of the time, and that when we remove an item of fruit from a box we are equally likely to select any of the pieces of fruit in the box 35 Random Variables (example 2/4) The identity of the box that will be chosen is a random variable, which we shall denote by B. This random variable can take one of two possible values, namely r (red box) or b (blue box). The identity of the fruit is also a random variable and will be denoted by F. It can take either of the values a (for apple) or o (for organge). 36

19 Example 3/4 We shall hlldfi define the probability bili of an event to be the fraction of times that event occurs out of the total number of trials, in thelimit thatthetotalthe total number of trials goes to infinity. The probability of selecting the red box is 4/10and the probability of selecting the blue box is 6/10: p(b=r)=4/10 and p(b=b)=6/10. By definition, probabilities must lie in the interval [0,1]. If the events are mutually exclusive and if they include all possible outcomes, then the probabilities for those events must sum to one. 37 Example 4/4 What is the overall probability that the selection procedure will pick an apple? Given that we have chosen an orange, what is the probability that the box we chose was the blue one? We need to learn sum rule and the product rule in order to answer those questions. 38

20 Probability Theory Two random variables X and Y Total number N of instances of these variables The number of points in column i, corresponding to X=x i, is denoted by c i The number of points in row j, corresponding to Y=y j, is denoted by r j Marginal Probability Joint Probability Conditional Probability Here we are implicitly itl considering i the limit it N 39 Probability Theory Sum Rule Product Rule 40

21 The Rules of Probability Sum Rule Product Rule 41

INTRODUCTION TO PATTERN RECOGNITION

INTRODUCTION TO PATTERN RECOGNITION INSTRUCTOR: WEI DING 1 Pattern Recognition Automatic discovery of regularities in data through the use of computer algorithms With the use of these regularities to take