Statistical learning. Chapter 20, Sections 1 4 1

Similar documents
Statistical learning. Chapter 20, Sections 1 3 1

Statistical learning. Chapter 20, Sections 1 3 1

Learning with Probabilities

Statistical Learning. Philipp Koehn. 10 November 2015

From inductive inference to machine learning

Notes on Machine Learning for and

Probabilistic Reasoning. Kee-Eung Kim KAIST Computer Science

Introduction to Bayesian Learning. Machine Learning Fall 2018

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

CSC321 Lecture 18: Learning Probabilistic Models

Algorithmisches Lernen/Machine Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Learning (II)

PMR Learning as Inference

Introduction to Bayesian Learning

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Density Estimation. Seungjin Choi

Two Roles for Bayesian Methods

CSCE 478/878 Lecture 6: Bayesian Learning

Lecture : Probabilistic Machine Learning

Scribe to lecture Tuesday March

MODULE -4 BAYEIAN LEARNING

Bayesian Networks aka belief networks, probabilistic networks. Bayesian Networks aka belief networks, probabilistic networks. An Example Bayes Net

Expectation Maximization [KF Chapter 19] Incomplete data

Probabilistic & Unsupervised Learning

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Lecture 6: Gaussian Mixture Models (GMM)

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Bayesian Machine Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Density Estimation: ML, MAP, Bayesian estimation

Model Averaging (Bayesian Learning)

Naïve Bayes classification

ECE521 week 3: 23/26 January 2017

CPSC 540: Machine Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Mathematical Formulation of Our Example

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Bayesian Networks BY: MOHAMAD ALSABBAGH

Lecture 3: Pattern Classification

Gaussian Mixture Models, Expectation Maximization

The Expectation Maximization or EM algorithm

Statistical Pattern Recognition

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Gaussian Mixture Models

Machine Learning Basics: Maximum Likelihood Estimation

CSCI-567: Machine Learning (Spring 2019)

The Naïve Bayes Classifier. Machine Learning Fall 2017

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Machine Learning for Signal Processing Bayes Classification and Regression

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

COMP90051 Statistical Machine Learning

Probability models for machine learning. Advanced topics ML4bio 2016 Alan Moses

Learning Bayesian network : Given structure and completely observed data

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Generative Clustering, Topic Modeling, & Bayesian Inference

y Xw 2 2 y Xw λ w 2 2

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Learning from Data. Amos Storkey, School of Informatics. Semester 1. amos/lfd/

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Logistic Regression. Machine Learning Fall 2018

Mixture of Gaussians Models

CS-E3210 Machine Learning: Basic Principles

Machine Learning and Bayesian Inference

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Machine Learning for Data Science (CS4786) Lecture 12

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Mining Classification Knowledge

Latent Variable Models and Expectation Maximization

CSE 473: Artificial Intelligence Autumn Topics

Bayesian Inference for Dirichlet-Multinomials

Bayesian RL Seminar. Chris Mansley September 9, 2008

Bayesian Methods: Naïve Bayes

CSE446: Clustering and EM Spring 2017

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

CS534 Machine Learning - Spring Final Exam

CS6220: DATA MINING TECHNIQUES

But if z is conditioned on, we need to model it:

Lecture 4: Probabilistic Learning

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Expectation Maximization Algorithm

Estimation. Max Welling. California Institute of Technology Pasadena, CA

Expectation Maximization

BAYESIAN LEARNING. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.6]

Computational Genomics

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring 2016

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

EM & Variational Bayes

Probabilistic Time Series Classification

Unsupervised Learning

EM for Spherical Gaussians

Bayesian statistics. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Transcription:

Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1

Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete data linear regression Expectation-Maximization (EM) algorithm Instance-based learning Chapter 20, Sections 1 4 2

Full Bayesian learning View learning as Bayesian updating of a probability distribution over the hypothesis space H is the hypothesis variable, values h 1, h 2,..., prior P(H) jth observation d j gives the outcome of random variable D j training data d = d 1,..., d N Given the data so far, each hypothesis has a posterior probability: P (h i d) = αp (d h i )P (h i ) where P (d h i ) is called the likelihood Predictions use a likelihood-weighted average over the hypotheses: P(X d) = Σ i P(X d, h i )P (h i d) = Σ i P(X h i )P (h i d) No need to pick one best-guess hypothesis! Chapter 20, Sections 1 4 3

Example Suppose there are five kinds of bags of candies: 10% are h 1 : 100% cherry candies 20% are h 2 : 75% cherry candies + 25% lime candies 40% are h 3 : 50% cherry candies + 50% lime candies 20% are h 4 : 25% cherry candies + 75% lime candies 10% are h 5 : 100% lime candies Then we observe candies drawn from some bag: What kind of bag is it? What flavour will the next candy be? Chapter 20, Sections 1 4 4

Posterior probability of hypotheses Posterior probability of hypothesis 1 0.8 0.6 0.4 0.2 0 P(h 1 d) P(h 2 d) P(h 3 d) P(h 4 d) P(h 5 d) 0 2 4 6 8 10 Number of samples in d Chapter 20, Sections 1 4 5

Prediction probability 1 P(next candy is lime d) 0.9 0.8 0.7 0.6 0.5 0.4 0 2 4 6 8 10 Number of samples in d Chapter 20, Sections 1 4 6

MAP approximation Summing over the hypothesis space is often intractable (e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes) Maximum a posteriori (MAP) learning: choose h MAP maximizing P (h i d) I.e., maximize P (d h i )P (h i ) or log P (d h i ) + log P (h i ) Log terms can be viewed as (negative of) bits to encode data given hypothesis + bits to encode hypothesis This is the basic idea of minimum description length (MDL) learning For deterministic hypotheses, P (d h i ) is 1 if consistent, 0 otherwise MAP = simplest consistent hypothesis (cf. science) Chapter 20, Sections 1 4 7

ML approximation For large data sets, prior becomes irrelevant Maximum likelihood (ML) learning: choose h ML maximizing P (d h i ) I.e., simply get the best fit to the data; identical to MAP for uniform prior (which is reasonable if all hypotheses are of the same complexity) ML is the standard (non-bayesian) statistical learning method Chapter 20, Sections 1 4 8

ML parameter learning in Bayes nets Bag from a new manufacturer; fraction θ of cherry candies? Any θ is possible: continuum of hypotheses h θ θ is a parameter for this simple (binomial) family of models Suppose we unwrap N candies, c cherries and l = N c limes These are i.i.d. (independent, identically distributed) observations, so P (d h θ ) = N j = 1 P (d j h θ ) = θ c (1 θ) l Maximize this w.r.t. θ which is easier for the log-likelihood: L(d h θ ) = log P (d h θ ) = N dl(d h θ ) dθ P( F=cherry) θ Flavor j = 1 log P (d j h θ ) = c log θ + l log(1 θ) = c θ l 1 θ = 0 θ = c c + l = c N Seems sensible, but causes problems with 0 counts! Chapter 20, Sections 1 4 9

Multiple parameters Red/green wrapper depends probabilistically on flavor: Likelihood for, e.g., cherry candy in green wrapper: P (F = cherry, W = green h θ,θ1,θ 2 ) = P (F = cherry h θ,θ1,θ 2 )P (W = green F = cherry, h θ,θ1,θ 2 ) = θ (1 θ 1 ) N candies, r c red-wrapped cherry candies, etc.: P( F=cherry) θ Flavor F Wrapper cherry lime P( W=red F) θ 1 θ 2 P (d h θ,θ1,θ 2 ) = θ c (1 θ) l θ r c 1 (1 θ 1 ) gc θ r l 2 (1 θ 2 ) g l L = [c log θ + l log(1 θ)] + [r c log θ 1 + g c log(1 θ 1 )] + [r l log θ 2 + g l log(1 θ 2 )] Chapter 20, Sections 1 4 10

Multiple parameters contd. Derivatives of L contain only the relevant parameter: L θ = c θ l 1 θ = 0 θ = c c + l L θ 1 = r c θ 1 g c 1 θ 1 = 0 θ 1 = r c r c + g c L θ 2 = r l θ 2 g l 1 θ 2 = 0 θ 2 = r l r l + g l With complete data, parameters can be learned separately Chapter 20, Sections 1 4 11

P(y x) 3.5 4 2.5 3 1.5 2 0.5 1 0 Example: linear Gaussian model 0 0.2 0.4 0.6 x 0.8 1 0 0.20.4 0.6 0.81 y Maximizing P (y x) = 1 2πσ e (y (θ 1 x+θ 2 ))2 2σ 2 w.r.t. θ 1, θ 2 y 1 0.8 0.6 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x = minimizing E = N j = 1 (y j (θ 1 x j + θ 2 )) 2 That is, minimizing the sum of squared errors gives the ML solution for a linear fit assuming Gaussian noise of fixed variance Chapter 20, Sections 1 4 12

Expectation Maximization (EM) When to use: Data is only partially observable Unsupervised clustering (target value unobservable) Supervised learning (some instance attributes unobservable) Some uses: Train Bayesian Belief Networks Unsupervised clustering (AUTOCLASS) Learning Hidden Markov Models Chapter 20, Sections 1 4 13

Generating Data from Mixture of k Gaussians p(x) x Each instance x generated by 1. Choosing one of the k Gaussians with uniform probability 2. Generating an instance at random according to that Gaussian Chapter 20, Sections 1 4 14

EM for Estimating k Means Given: Instances from X generated by mixture of k Gaussian distributions Unknown means µ 1,..., µ k of the k Gaussians Don t know which instance x i was generated by which Gaussian Determine: Maximum likelihood estimates of µ 1,..., µ k Think of full description of each instance as y i = x i, z i1, z i2, where z ij is 1 if x i generated by jth Gaussian x i observable z ij unobservable Chapter 20, Sections 1 4 15

EM for Estimating k Means EM Algorithm: Pick random initial h = µ 1, µ 2, then iterate E step: Calculate the expected value E[z ij ] of each hidden variable z ij, assuming the current hypothesis h = µ 1, µ 2 holds. E[z ij ] = p(x = x i µ = µ j ) 2 n=1 p(x = x i µ = µ n ) = e 1 2σ 2(x i µ j ) 2 2 n=1 e 1 2σ 2(x i µ n ) 2 step: Calculate a new maximum likelihood hypothesis h = µ 1, µ 2, assuming the value taken on by each hidden variable z ij is its expected value E[z ij ] calculated above. Replace h = µ 1, µ 2 by h = µ 1, µ 2. µ j m i=1 E[z ij ] x i m i=1 E[z ij ] Chapter 20, Sections 1 4 16

EM Algorithm Converges to local maximum likelihood h and provides estimates of hidden variables z ij In fact, local maximum in E[ln P (Y h)] Y is complete (observable plus unobservable variables) data Expected value is taken over possible values of unobserved variables in Y Chapter 20, Sections 1 4 17

General EM Problem Given: Observed data X = {x 1,..., x m } Unobserved data Z = {z 1,..., z m } Parameterized probability distribution P (Y h), where Y = {y 1,..., y m } is the full data y i = x i z i h are the parameters Determine: h that (locally) maximizes E[ln P (Y h)] Many uses: Train Bayesian belief networks Unsupervised clustering (e.g., k means) Hidden Markov Models Chapter 20, Sections 1 4 18

General EM Method Define likelihood function Q(h h) which calculates Y = X Z using observed X and current parameters h to estimate Z Q(h h) E[ln P (Y h ) h, X] EM Algorithm: Estimation (E) step: Calculate Q(h h) using the current hypothesis h and the observed data X to estimate the probability distribution over Y. Q(h h) E[ln P (Y h ) h, X] Maximization (M) step: Replace hypothesis h by the hypothesis h that maximizes this Q function. h argmax h Q(h h) Chapter 20, Sections 1 4 19

Instance-Based Learning Key idea: just store all training examples x i, f(x i ) Nearest neighbor: Given query instance x q, first locate nearest training example x n, then estimate ˆf(x q ) f(x n ) k-nearest neighbor: Given x q, take vote among its k nearest nbrs (if discrete-valued target function) take mean of f values of k nearest nbrs (if real-valued) ˆf(x q ) k i=1 f(x i ) k Chapter 20, Sections 1 4 20

When To Consider Nearest Neighbor Instances map to points in R n Less than 20 attributes per instance Lots of training data Advantages: Training is very fast Learn complex target functions Don t lose information Disadvantages: Slow at query time Easily fooled by irrelevant attributes Chapter 20, Sections 1 4 21