PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014
Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality 6 Decision theory 7 Information theory
What is Machine Learning? Outline of this section 1 What is Machine Learning? Basic concepts 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality 6 Decision theory 7 Information theory
What is Machine Learning? Basic concepts What is ML? Figure 1: Examples of hand-written numbers Training set Target vector Testing set
What is Machine Learning? Basic concepts 1 Generalization: The ability to categorize correctly new examples that different from those used for training is known as generalization. 2 Supervised learning Regression Classification 3 Unsupervised learning Clustering Density estimation Projecting data from high-dimensional space to low-dimensional space
Curve Fitting Outline of this section 1 What is Machine Learning? 2 Curve Fitting An example Overfitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality 6 Decision theory 7 Information theory
Curve Fitting An example An example(important) Figure 2: Examples of polynomial curve fitting Green curve - sin(2πx) Blue dots - t n
Curve Fitting An example The polynominal function is y(x, w) = w 0 + w 1 x + w 2 x 2 + + w M x M (1) and the error function is E(w) = 1 2 N {y(x n, w) t n } 2 (2) n=1 We can solve the curve fitting problem by choosing the value of w for which E(w) is as small as possible.
Curve Fitting An example Figure 3: Curve fitting with different value of M
Curve Fitting An example Figure 4: Different values of w given different M
Curve Fitting Overfitting Why did overfitting happen? Figure 5: Overfitting The Taylor series of sin(2πx) has infinite items. The more flexible polynomial with larger values of M are becoming increasingly tuned to the random noise on the target values.
Curve Fitting Overfitting How to estimate over-fitting? 1 Use testing set to evaluate the E(x) in (2); 2 Use root-mean-square (RMS) error: 2E(w E RMS = ) N The error is a measure of how well we are doing in predicting the values of result for new observations. (3)
Curve Fitting Overfitting How to avoid/control over-fitting? I 1 Increasing the size of training data (N) Figure 6: More training data
Curve Fitting Overfitting How to avoid/control over-fitting? II 2 Regularization Ẽ(w) = 1 2 N {y(x n, w) t n } 2 + λ 2 w 2 (4) n=1 where w 2 w T w = w 2 0 + w2 1 + + w2 M,and λ governs the relative importance of the regularization term compared with the sum-of-squares error term.
Curve Fitting Overfitting How to avoid/control over-fitting? III Figure 7: Different coefficients with different values of λ
Curve Fitting Overfitting How to avoid/control over-fitting? IV Figure 8: Curves with different value of λ 3 A proper value for model complexity 4 Adopting a Bayesian approach
Probability Theory Outline of this section 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory The rules of probability Bayesian probabilities Curve fitting re-visited 4 Model Selection 5 The curse of dimensionality 6 Decision theory 7 Information theory
Probability Theory The rules of probability The rules of probability 1 Sum rule p(x) = Y p(x, Y ) (5) 2 Product rule 3 Bayes theorem p(x, Y ) = p(y X)p(X) (6) p(y X) = p(x Y )p(y ) p(x) (7) Plays a central role in pattern recognition and machine learning.
Probability Theory Bayesian probabilities Bayesian probabilities Now we turn to a more general Bayesian view. Probabilities provide a quantification of uncertainty. i.e. The polar ice
Probability Theory Bayesian probabilities Insight into Bayes theorem prior probability - p(w) observed data - D = {t 1, t 2,..., t N } p(w D) = p(d w)p(w) p(d) (8) convert prior probability p(d) to posterior probability p(w D). Bayes theorem can stated in words posterior likelihood prior
Probability Theory Curve fitting re-visited Curve fitting re-visited I Here we return to the example of curve fitting, and gain some insights into error function and regularization. training data - x = (x 1, x 2,..., x N ) T target values - t = (t 1, t 2,..., t N ) T t has a Gaussian distribution with mean equal to value y(x, w) thus we have p(t x, w, β) = N (t y(x, w), β 1 ) (9) where β is the inverse variance of the distribution.
Probability Theory Curve fitting re-visited Curve fitting re-visited II Figure 9: Distribution of t
Probability Theory Curve fitting re-visited Curve fitting re-visited III We now use the training data {x, t} to determine the values of th unknown parameters w and β by maximum likelihood. p(t x, w, β) = N N (t n y(x n, w), β 1 ) (10) n=1 By maximize the logarithm of likelihood ln p(t x, w, β) = β N{y(x n, w) t n } 2 + N 2 2 ln β N 2 ln(2π) n=1 (11), we can get the corresponding polynomial coefficients w ML. And β also can be got 1 β ML = 1 N N {y(x n, w ML ) t n } 2 (12) n=1
Probability Theory Curve fitting re-visited Curve fitting re-visited IV The first item of the above equation(11) has the same formation of sum-of-squares error in equation(2). A step towards Bayesian approach Consider a Gaussian distribution p(w α) = N (w 0, α 1 I) = ( α 2π ) (M+1) 2 exp{ α 2 wt w} (13) So we have p(w x, w, α, β) p(t x, w, β)p(w α) (14) Maximizing posterior is equivalent to minimizing the following equation β N{y(x n, w) t n } 2 + α 2 2 wt w (15) n=1
Probability Theory Curve fitting re-visited Bayesian curve fitting Will be introduced in Section 3.3.
Model Selection Outline of this section 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality 6 Decision theory 7 Information theory
Model Selection Model Selection I If the data is plentiful otherwise, we could use corss-validation. Figure 10: Cross-validation procedure
Model Selection Model Selection II To correct for the maximum likelihood, we introduce Akaike information criterion(aic), choose the model for which the quantity ln p(d w ML ) M (16) is largest.
The curse of dimensionality Outline of this section 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality 6 Decision theory 7 Information theory
The curse of dimensionality The curse of dimensionality I Figure 11: Data
The curse of dimensionality The curse of dimensionality II Figure 12: Result
The curse of dimensionality The curse of dimensionality III Figure 13: The curse of dimensionality
The curse of dimensionality The curse of dimensionality IV V D (1) V D (1 ε) V D (1) V D (r) = K D r D (17) = 1 (1 ε) D (18)
The curse of dimensionality The curse of dimensionality V Figure 14: The mass distribution Not all intuitions developed in spaces of low dimensionality will generalize to spaces of many dimensionality.
The curse of dimensionality Why we can apply effective techniques applicable to high-dimensional spaces? 1 Real data will often be confined to a region of the space having lower effective dimensionality 2 Real data will often typically exhibit some smoothness properties so that for most part small changes in the input variables will produce small changes in the target variables.
Decision theory Outline of this section 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality 6 Decision theory Minimizing the misclassification rate Minimizing the expected loss The reject option Inference & decision 7 Information theory
Decision theory Minimizing the misclassification rate Leave out.
Decision theory Minimizing the expected loss Leave out.
Decision theory The reject option The classification error arise where the largest of the posterior probabilities p(c k x) is significantly less than unity, or equivalently where the joint distribution p(x, C k ) have comparable values.
Decision theory Inference & decision Three approaches to solving decision problems Bayesian approach - generative models Model the posterior probability directly - discriminative models Directly generative class label - probabilities plays no role
Information theory Outline of this section 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality 6 Decision theory 7 Information theory Key concepts Relate the key concepts to PR
Information theory Key concepts Entropy H[x] = p(x) log 2 p(x) (19) Kullback-Leibler divergence (KL divergence) KL(p q) = p(x) ln q(x)dx ( p(x) ln p(x)dx) = p(x) ln{ q(x) p(x) }dx (20)
Information theory Relate the key concepts to PR p(x) - an known distribution q(x θ) - parametric distribution used to approximate p(x) x n - for n = 1, 2,..., N, drawn from p(x), so that KL(p q) N { ln q(x n θ) + ln p(x n )} (21) n=1 minimizing this KL divergence is equivalent to maximizing the likelihood function.
Thank you!