Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Similar documents
Bayesian Methods: Naïve Bayes

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes

Naïve Bayes classification

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Expectation maximization

Bayesian Models in Machine Learning

CPSC 340: Machine Learning and Data Mining

Machine Learning

Bayesian Learning (II)

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Introduction to Probabilistic Machine Learning

Probability and Estimation. Alan Moses

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Learning Bayesian network : Given structure and completely observed data

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning

Lecture : Probabilistic Machine Learning

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

ECE521 week 3: 23/26 January 2017

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Introduc)on to Bayesian methods (con)nued) - Lecture 16

CSCI-567: Machine Learning (Spring 2019)

Curve Fitting Re-visited, Bishop1.2.5

Probability Theory for Machine Learning. Chris Cremer September 2015

Parametric Techniques Lecture 3

CMU-Q Lecture 24:

Non-parametric Methods

The Bayes classifier

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Generative Models for Discrete Data

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Parametric Techniques

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Announcements. Proposals graded

6.867 Machine Learning

Introduction to Machine Learning

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Computational Cognitive Science

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Density Estimation. Seungjin Choi

Logistic Regression. Machine Learning Fall 2018

Midterm: CS 6375 Spring 2018

Estimating Parameters

Computational Cognitive Science

Day 5: Generative models, structured classification

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Bayesian Analysis for Natural Language Processing Lecture 2

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

COMP90051 Statistical Machine Learning

Bias-Variance Tradeoff

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

CSC321 Lecture 18: Learning Probabilistic Models

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Machine Learning for Signal Processing Bayes Classification

CS 361: Probability & Statistics

Generative v. Discriminative classifiers Intuition

Machine Learning for Signal Processing Bayes Classification and Regression

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

CPSC 540: Machine Learning

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

STA414/2104 Statistical Methods for Machine Learning II

COS513 LECTURE 8 STATISTICAL CONCEPTS

The Naïve Bayes Classifier. Machine Learning Fall 2017

Introduction to Bayesian Learning. Machine Learning Fall 2018

6.867 Machine Learning

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Generative v. Discriminative classifiers Intuition

Naive Bayes and Gaussian Bayes Classifier

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Probabilistic Graphical Models

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning

Machine Learning

Introduction to Machine Learning

Discrete Binary Distributions

Bayesian Methods for Machine Learning

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

Computational Perception. Bayesian Inference

Machine Learning, Fall 2012 Homework 2

Nonparametric Bayesian Methods (Gaussian Processes)

Machine Learning Linear Classification. Prof. Matteo Matteucci

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

CS 361: Probability & Statistics

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Transcription:

Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides

Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework unites the two Learning can be viewed as statistical inference Two kinds of data models Generative Conditional Two kinds of probability models Parametric Non-parametric 2

Classification by density estimation The data is generated according to a distribution D Suppose you had access to D, then classification becomes simple: This is the Bayes optimal classifier which achieves the smallest expected loss among all classifiers : expected loss of a predictor Unfortunately, we don t have access to the distribution 3

Classification by density estimation This suggests that one way to learn a classifier is to estimate D Training data parametric distribution Estimation We will assume that each point is independently generated from D A new point doesn t depend on previous points Estimate the parameters of the distribution Commonly referred to as the i.i.d assumption or independently and identically distributed assumption 4

Statistical estimation Coin toss: observed sequence {H, T, H, H} Probability of H: What is the value of that best explains the observed data? Maximum likelihood principle (MLE): pick parameters of the distribution that maximize the likelihood of the observed data Likelihood of data: i.i.d data Maximize likelihood: 5

Log-likelihood It is convenient to maximize the logarithm of the likelihood instead Log-likelihood of the observed data: Maximizing the log-likelihood is equivalent to maximizing likelihood Log is a concave monotonic function Products become sums Numerically stable 6

Log-likelihood Log-likelihood of observing H-many heads and T-many tails: Maximizing the log-likelihood: 7

Rolling a die Suppose you are rolling a k-sided die with parameters: You observe: Log-likelihood of the data: Maximizing the log-likelihood by setting the derivative to zero: We need additional constraints: 8

Lagrangian multipliers Constrained optimization: subject to: Unconstrained optimization: At optimality: 9

Naive Bayes Consider the binary prediction problem Let the data be distributed according to a probability distribution: We can simplify this using the chain rule of probability: Naive Bayes assumption: E.g., The words free and money are independent given spam 10

Naive Bayes Naive Bayes assumption: We can simplify the joint probability distribution as: // simpler distribution At this point we can start parametrizing the distribution 11

Naive Bayes: a simple case Case: binary labels and binary features Probability of the data: }1+2D parameters // label +1 // label -1 12

Naive Bayes: parameter estimation Given data we can estimate the parameters by maximizing data likelihood Similar to the coin toss example the maximum likelihood estimates are: // fraction of the data with label as +1 // fraction of the instances with 1 among +1 // fraction of the instances with 1 among -1 Other cases: Nominal features: Multinomial distribution (like rolling a die) Continuous features: Gaussian distribution inductive bias 13

Naive Bayes: prediction To make predictions compute: // Bayes optimal prediction // Bayes rule For binary labels we can also compute the likelihood ratio: Or the log likelihood ratio: 14

Naive Bayes: decision boundary Naive bayes classifier has a linear decision boundary! 15

Generative and conditional models Generative models: Model the joint distribution p(x,y) Use Bayes rule to compute the label posterior Need to make simplifying assumptions (e.g. Naive bayes) In most cases we are given x and are only interested in the labels y Conditional models: Model the distribution p(y x) Saves some modeling effort Can assume a simpler parametrization of the distribution p(y x) Most of ML we did so far directly aimed at predicting y from x 16

Conditional models: regression Assume that y has a linear relationship with x Generative story of the dataset: For i = 1 to N, Compute: Compute: Compute: This can be written as:, and The log-likelihood of the dataset is: Maximizing log-likelihood is equivalent to minimizing squared error 17

Conditional models: classification The sigmoid function: Maps - 0, 1 σ(-z) = 1-σ(z), dσ/dz = σ(z)(1-σ(z)) Generative story of the data: For i = 1 to N, Compute: Compute: Compute: 18

Conditional models: classification The log-likelihood of the dataset is: // ignoring constants Maximizing log-likelihood is equivalent to minimizing logistic loss This is also called as logistic regression 19

Regularization with priors Coin toss: {H,H,H,H} β = 1 Maximum likelihood estimation (MLE): likelihood Maximum a-posteriori estimation (MAP): prior likelihood data probability MAP estimation in log space: log-prior log-likelihood 20

Regularization with priors: coin toss Beta distribution as a prior on β Posterior over β given the prior and H-many heads and T-many tails: 21

Conjugate priors If the prior and posterior are in the same family, then the prior is conjugate to the likelihood posterior prior likelihood Beta is conjugate to Bernoulli Prior: Beta(a, b) Data: {H, T} Posterior: Beta(a+H, b+t) Interpretation of the prior: pseudo count of a-1 heads and b-1 tails Dirichlet is conjugate to Multinomial Prior: Data: {k1, k2,, kn} occurrence of each side Posterior: Interpretation of the prior: pseudo count of ai -1 for the i th side 22

Regularization with priors: regression Assume that y has a linear relationship with x Generative story of the dataset: For i = 1 to N, Compute: Compute: Compute: Assume a Gaussian prior on the weights: MAP estimate of w: Laplace prior for l1 MAP is same as l2 regularized least-squares regression 23

Non-parametric density models So far we assumed that the probability distribution was parametric Gaussian distribution, Binomial distribution, etc This allowed us to estimate the data distribution by estimating the parameters of the probability distribution However, the data distribution can be complicated For example there might be multiple modes Non-parametricdensity models offer a flexible alternative 24

Density estimation using histograms This is the simplest example of a non-parametricdensity model The bin size is a hyperparameter of the model too large too small 25

Kernel density estimation Histograms are sums of delta functions centered at each point The hyperparameter b controls the width of the delta function The function K is called the kernel function These density estimators are also called as Parzen window estimators Set hyperparameters by cross-validation MLE estimate is b=0. This is clearly wrong (overfitting) 26

Kernel density estimation: example Rectangle kernel 27

Kernel density estimation: example Gaussian kernel 28

Kernel density classifier Estimate p(x +1) and p(x -1) separately Compute likelihood ratio: p(+1)p(x +1) / p(-1)p(x -1) Predict class +1 if LR > 1 small width large width knn classifier is a Kernel density classifier with kernel width = distance to the k th nearest neighbor Figure from Duda et al. 29

Summary Probabilistic modeling views learning as statistical inference Two ways to estimate parameters of the distribution Maximum likelihood: maximize p(d θ) Maximum a-posteriori: maximize p(θ D) Two kinds of data models Generative: p(y, x) Example: Naive bayes, Kernel density Conditional: p(y x) Example: Linear and logistic regression Two kinds of probability models Parametric: Gaussian, Bernoulli, etc Learning by MLE and MAP Non-parametric: kernel density estimators Learning by cross validation 30

Slides credit Slides are highly adapted from Subhransu Maji s course. Figure of the logistic and linear regression are from Wikipedia Figure of the beta distribution is from Wikipedia Figures for kernel density estimation are from http://www.mglerner.com/blog/?p=28 (the page has an interactive demo) Parzen window figure: Pattern Classification, Duda, Hart & Stork Some slides are based on the CIML book by Hal Daume III 31