Intelligent Systems Statistical Machine Learning

Similar documents
Intelligent Systems Statistical Machine Learning

Non-Bayesian decision making

Bayesian decision making

Intelligent Systems Discriminative Learning, Neural Networks

Bayesian Decision Theory

Density Estimation: ML, MAP, Bayesian estimation

ECE521 week 3: 23/26 January 2017

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Machine Learning 4771

Bayesian Decision Theory

LECTURE NOTE #3 PROF. ALAN YUILLE

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Machine Learning. Probability Theory. WS2013/2014 Dmitrij Schlesinger, Carsten Rother

PATTERN RECOGNITION AND MACHINE LEARNING

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Machine Learning Lecture 2

Least Squares Regression

Bayesian Learning. Bayesian Learning Criteria

PATTERN RECOGNITION AND MACHINE LEARNING

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Bayesian Decision Theory

Bayesian Decision Theory

Least Squares Regression

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process

Topic 3: Hypothesis Testing

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

BAYES DECISION THEORY

Lecture 2. Bayes Decision Theory

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

Naïve Bayes classification

Linear & nonlinear classifiers

Introduction to Statistical Inference

5. Discriminant analysis

Final Examination CS540-2: Introduction to Artificial Intelligence

Kernel Methods and Support Vector Machines

Bayesian Methods: Naïve Bayes

Lecture 7 Introduction to Statistical Decision Theory

STA 4273H: Statistical Machine Learning

Announcements. Proposals graded

Machine Learning Lecture 2

Bayesian Learning (II)

Detection theory. H 0 : x[n] = w[n]

Lecture 2 Machine Learning Review

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Today. Calculus. Linear Regression. Lagrange Multipliers

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Introduction to Bayesian Learning. Machine Learning Fall 2018

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

INTRODUCTION TO PATTERN RECOGNITION

Stochastic Gradient Descent

Notes on Machine Learning for and

Bayes Rule for Minimizing Risk

Detection theory 101 ELEC-E5410 Signal Processing for Communications

Lecture 5: GPs and Streaming regression

Bayesian Decision and Bayesian Learning

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Classification. 1. Strategies for classification 2. Minimizing the probability for misclassification 3. Risk minimization

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

Introduction to Machine Learning

Logistics. Naïve Bayes & Expectation Maximization. 573 Schedule. Coming Soon. Estimation Models. Topics

Gaussian discriminant analysis Naive Bayes

Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics. 1 Executive summary

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

MLE/MAP + Naïve Bayes

Pattern Recognition and Machine Learning Course: Introduction. Bayesian Decision Theory.

Definition 3.1 A statistical hypothesis is a statement about the unknown values of the parameters of the population distribution.

The Expectation-Maximization Algorithm

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Multivariate statistical methods and data mining in particle physics

Machine Learning Linear Classification. Prof. Matteo Matteucci

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Parametric Techniques Lecture 3

Introduction to Bayesian Learning

Lecture : Probabilistic Machine Learning

Machine Learning Lecture 5

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Grundlagen der Künstlichen Intelligenz

Ch 4. Linear Models for Classification

CSE 473: Artificial Intelligence Autumn Topics

Bayesian Support Vector Machines for Feature Ranking and Selection

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Parametric Techniques

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Pattern Recognition and Machine Learning Course: Introduction. Bayesian Decision Theory.

Brief Introduction of Machine Learning Techniques for Content Analysis

Bias-Variance Tradeoff

Exercises * on Bayesian Theory and Graphical Models

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Machine Learning Lecture 7

Support Vector Machine (continued)

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Bayesian Networks BY: MOHAMAD ALSABBAGH

Ways to make neural networks generalize better

A first model of learning

Transcription:

Intelligent Systems Statistical Machine Learning Carsten Rother, Dmitrij Schlesinger WS2015/2016,

Our model and tasks The model: two variables are usually present: - the first one is typically discrete k 2 K and is called class - the second one is often continuous x 2 X and is called observation The recognition (inference) task: Let the joint probability distribution Observe x, estimate k. p(x, k) be given. The (statistical) learning task: given a training set L = (x 1,k 1 ), (x 2,k 2 ),...,(x l,k l ) find the corresponding probability distribution p(x, k) 2

Outline Decision making: Bayesian Decision Theory Non-Bayesian formulation Statistical Learning Maximum Likelihood Principle 3

Bayesian Decision Theory: Idea a game Somebody samples a pair (x, k) according to a p.d. p(x, k) He keeps k hidden and presents x to you You decide for some k according to a chosen decision strategy Somebody penalizes your decision according to a loss-function, i.e. he compares your decision with the true hidden k You know both p(x, k) and the loss-function (how does he compare) Your goal is to design the decision strategy in order to pay as less as possible in average. 4

Bayesian Risk The decision set D. Note: it needs not to coincide with K!!! Examples: decisions I don t know, surely not this class etc. Decision strategy (mapping) e : X! D Loss-function C : D K! R The Bayesian Risk is the expected loss: R(e) = X x X (should be minimized with respect to the decision strategy). For a particular observation : k x p(x, k) C e(x),k! min e R(d) = X k p(k x) C(d, k)! min d 5

Maximum A-posteriori Decision (MAP) The loss is the simplest one (called delta-function): C(k, k 0 )= (k6=k 0 )= 1 if k 6= k 0 0 otherwise i.e. we pay 1 if we are wrong, no matter what error do we make. From that follows: R(k) = X k 0 p(k 0 x) (k6=k 0 )= = X k 0 p(k 0 x) p(k x) =1 p(k x)! min k p(k x)! max k 6

A MAP example Let K = {1, 2}, x 2 R 2, p(k) be given. The conditional probability distribution for observations given classes are Gaussians: p(x k) = 1 2 2 k exp apple x µk 2 2 2 k The loss function is C(k, k 0 )= (k6=k 0 ), i.e. we want MAP The decision strategy (the mapping e : X! K ) partitions the input space into two regions: the one corresponding to the first and the one corresponding to the second class. How does this partition look like? 7

A MAP example For a particular x we decide for 1, if p(1) 1 2 2 1 exp apple x µ1 2 2 2 1 >p(2) 1 2 2 2 exp apple x µ2 2 2 2 2 Special case for simplicity: 1 = 2 and p(1) = p(2)! the decision strategy is (derivations on the board): hx, µ 2 µ 1 i >const! a linear classifier the hyperplane orthogonal to µ 2 µ 1 (more examples at the exercise) 8

Decision with rejection The decision set is D = K [ {rw}, i.e. it is extended by a special decision I don t know (rejection). The loss-function is C(d, k) = (k6=k 0 ) if d 2 K " if d = rw i.e. we pay a (reasonable) penalty if we are lazy to decide. Case-by-case analysis: 1. We decide for a class d 2 K, decision is MAP d = k, the loss for this is 1 p(k x) 2. We decide to reject d = rw, the loss for this is "! The decision strategy: Compare p(k x) with 1 " and decide for the variant with the greater value. 9

Outline Decision making: Bayesian Decision Theory Non-Bayesian formulation Statistical Learning Maximum Likelihood Principle 10

Non-Bayesian Decisions Despite the generality of Bayesian approach, there are many tasks which cannot be expressed within the Bayesian framework: It is difficult to establish a penalty function, e.g. it does not assume values from the totally ordered set. A priori probabilities p(k) are not known or cannot be known because k is not a random event. An example Russian fairy tales hero When he turns to the left, he loses his horse, when he turns to the right, he loses his sword, and if he turns back, he loses his beloved girl. Is the sum of p 1 horses and p 2 swords is less or more than p 3 beloved girls? 11

Example: decision while curing a patient We have: x 2 X observations (features) measured on a patient k 2 K = {healthy, seriously sick} hidden states d 2 D = {do not cure, apply a drug} decisions Penalty problem: do not cure apply a drug how to assign real number to a penalty? healthy Correct decision small health damage seriously sick death possible Correct decision 12

Example: enemy or allied airplane? Observation x describes the observed airplane. Two hidden states: k =1 k =2 allied airplane enemy airplane The conditional probability p(x k) can depend on the observation xin a complicated manner but it exists and describes dependencies of the observation x on the situation k correctly. A-priori probabilities principle. p(k) are not known and can not be known in! the hidden state k is not a random event 13

Neyman-Pearson Task (1928, 1933) The strategy (partitioning the input space into two subsets X 1 [X 2 = X) is characterized by two numbers: 1. Probability of the false positive (false alarm)!(1) = X x2x 2 p(x 1) 2. Probability of the false negative (overlooked danger)!(2) = X p(x 2) x2x 1 Minimize the conditional probability of the false positive subject to the condition that the false negative is bounded:!(1)! min X 1,X 2 s.t.!(2) apple " 14

Outline Decision making: Bayesian Decision Theory Non-Bayesian formulation Statistical Learning Maximum Likelihood Principle 15

Learning Let a parameterized class (family) of probability distributions be given, i.e. p(x; ) 2 P Example the set of Gaussians in p(x; µ, )= R n 1 ( p 2 ) n exp apple x µ 2 2 2 parameterized by the mean µ 2 R n and standard deviation 2 R Let the training data be given: L =(x 1,x 2,...,x L ) One have to decide for a particular probability distribution from the given family, i.e. for a particular parameter (e.g. =(µ, ) ) for Gaussians). 16

Maximum Likelihood Principle Assumption: the training dataset is a realization of the unknown probability distribution it is sampled according to it.!! What is observed should have a high probability Maximize the probability of the training data with respect to the unknown parameter p(l; )! max 17

Example: Gaussian We have a Family of probability distributions for x 2 R n : apple 1 x µ 2 p(x; µ, )= ( p 2 ) exp n 2 2 and a training set L =(x 1,x 2,...,x L ) Assumption: the samples are independent identically distributed (iid.), i.e. the probability of the training set is p(l; µ, )= Y l p(x l ; µ, ) Take logarithm and maximize it with respect to the unknown parameters: ln p(l; µ, )= X l ln p(x l ; µ, )! max µ, 18

Example: Gaussian Substitute the model: X apple x l µ 2 n ln 2 2 = L n ln l Assume, we are interested only in the center µ, i.e. is given. Then the problem can be further simplified to X x l µ 2! min µ Take the derivative, set it to zero and resolve i.e. the mean value over the dataset. @ @µ = X l l (x l µ)= X l P l µ = x l L 1 X 2 2 l x l x l L µ =0 µ 2! max µ, 19

Some comments Maximum Likelihood estimator is not the only estimator there are many others as well. Maximum Likelihood is consistent, i.e. it gives the true parameters for infinite training sets. Consider the following experiment for an estimator: 1. We generate infinite numbers of training sets each one being finite; 2. For each training set we estimate the parameter; 3. We average all estimated values. If the average is the true parameter, the estimator is called unbiased. Maximum Likelihood is not always unbiased it depends on the parameter to be estimated. Examples mean for a Gaussian is unbiased, standard deviation not. 20

Summary Today: statistical Machine Learning 1. The model: two kinds of random variables observations, hidden variables. Tasks: inference and learning 2. Decision making: 1) Bayesian formulations decision strategies, Bayesian risk, MAP and others 2) Non-Bayesian formulations Neyman-Pearson Task 3. Learning: Maximum Likelihood Principle, consistency, (un)biased estimators 21