CS 6140: Machine Learning Spring 2016

Similar documents
CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

CS 6140: Machine Learning Spring 2017

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

CPSC 540: Machine Learning

Introduction to Bayesian Learning. Machine Learning Fall 2018

Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Latent Dirichlet Alloca/on

UVA CS / Introduc8on to Machine Learning and Data Mining

Machine Learning

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Learning with Probabilities

Statistical learning. Chapter 20, Sections 1 3 1

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

STAD68: Machine Learning

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Machine Learning

MLE/MAP + Naïve Bayes

Bayesian Models in Machine Learning

The Naïve Bayes Classifier. Machine Learning Fall 2017

Introduction to Machine Learning CMU-10701

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Bayesian Learning (II)

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Computer Vision. Pa0ern Recogni4on Concepts Part I. Luis F. Teixeira MAP- i 2012/13

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Bias/variance tradeoff, Model assessment and selec+on

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian networks Lecture 18. David Sontag New York University

CS 361: Probability & Statistics

Bayesian Methods for Machine Learning

Naïve Bayes classification

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Statistical learning. Chapter 20, Sections 1 4 1

Generative Model (Naïve Bayes, LDA)

Decision Trees Lecture 12

CSC321 Lecture 18: Learning Probabilistic Models

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Logis&c Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Announcements. Proposals graded

Introduc)on to Ar)ficial Intelligence

STA 4273H: Statistical Machine Learning

CS540 Machine learning L8

Statistical learning. Chapter 20, Sections 1 3 1

Bayesian Methods: Naïve Bayes

STA414/2104 Statistical Methods for Machine Learning II

An Introduc+on to Sta+s+cs and Machine Learning for Quan+ta+ve Biology. Anirvan Sengupta Dept. of Physics and Astronomy Rutgers University

Introduc)on to Ar)ficial Intelligence

Model Averaging (Bayesian Learning)

Bayesian Analysis for Natural Language Processing Lecture 2

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Lecture : Probabilistic Machine Learning

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Introduction to Particle Filters for Data Assimilation

Bayesian Inference and MCMC

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Introduction to Probabilistic Machine Learning

COMP 562: Introduction to Machine Learning

Computational Perception. Bayesian Inference

Regression.

Bayesian Machine Learning

(1) Introduction to Bayesian statistics

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

CS 630 Basic Probability and Information Theory. Tim Campbell

MLE/MAP + Naïve Bayes

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

STA 4273H: Sta-s-cal Machine Learning

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

PMR Learning as Inference

CS 361: Probability & Statistics

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 11: Hidden Markov Models

Probability and Statistical Decision Theory

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 8: Hidden Markov Models

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

A Brief Review of Probability, Bayesian Statistics, and Information Theory

Computational Cognitive Science

ECE521 week 3: 23/26 January 2017

Sta$s$cal sequence recogni$on

Introduction to Machine Learning

Priors in Dependency network learning

Parameter Es*ma*on: Cracking Incomplete Data

Generative Clustering, Topic Modeling, & Bayesian Inference

Bayesian RL Seminar. Chris Mansley September 9, 2008

CSC 411 Lecture 3: Decision Trees

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

Class Notes. Examining Repeated Measures Data on Individuals

Bias-Variance Tradeoff

Hierarchical Models & Bayesian Model Selection

Bayesian Regression Linear and Logistic Regression

Transcription:

CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa?on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu

Logis?cs Assignment 1 Due Feb 4 Electronic copy on blackboard Hard copy in class If you have discussed a problem with someone or get the idea from other sources (e.g. academic publica?ons, lectures, textbooks), you need to acknowledge it! Northeastern University Academic Integrity Policy hup://www.northeastern.edu/osccr/academicintegrity-policy/

Survey What do you expect you can learn from this course? Content of the Course Difficulty of the material Difficulty of the assignment Amount of programming

What We Learned Last Week Genera?ve Model and Discrimina?ve Model Logis?c Regression Genera?ve Models Genera?ve Models vs. Discrimina?ve Models Decision Tree

Genera?ve VS. Discrimina?ve Model Genera?ve model Learn P(X, Y) from training sample P(X, Y)=P(Y)P(X Y) Specifies how to generate the observed features x for y Discrimina?ve model Learn P(Y X) from training sample Directly models the mapping from features x to y

Genera?ve VS. Discrimina?ve Model Easy to fit the model

Genera?ve VS. Discrimina?ve Model Easy to fit the model Genera?ve model!

Genera?ve VS. Discrimina?ve Model Fit classes separately

Genera?ve VS. Discrimina?ve Model Fit classes separately Genera?ve model!

Genera?ve VS. Discrimina?ve Model Handle missing features easily

Genera?ve VS. Discrimina?ve Model Handle missing features easily Genera?ve model!

Genera?ve VS. Discrimina?ve Model Handle unlabeled training data

Genera?ve VS. Discrimina?ve Model Handle unlabeled training data Easier for Genera?ve model!

Genera?ve VS. Discrimina?ve Model Symmetric in inputs and outputs

Genera?ve VS. Discrimina?ve Model Symmetric in inputs and outputs Genera?ve model! Define p(x,y)

Genera?ve VS. Discrimina?ve Model Handle feature preprocessing

Genera?ve VS. Discrimina?ve Model Handle feature preprocessing Discrimina?ve model!

Genera?ve VS. Discrimina?ve Model Well-calibrated probabili?es

Genera?ve VS. Discrimina?ve Model Well-calibrated probabili?es Discrimina?ve model!

Logis?c Regression A discrimina?ve model sigm is sigmod func?on

Logis?c Regression

Bayesian Inference

Bayes Rules

Play tennis? Decision Tree

Entropy Entropy H(X) of a random variable X H(X) is the expected number of bits needed to encode a randomly drawn value of X (under most efficient code)

Informa?on Gain Gain(S,A)=expected reduc?on in entropy due to sor?ng on A

Today s Outline Bayesian Sta?s?cs Frequen?st Sta?s?cs Feature Selec?on Some slides are borrowed from Kevin Murphy s lectures

Fundamental principle of Bayesian sta?s?cs Everything that is uncertain is modeled with a probability distribu?on. Parameters Hyper-parameters Incorporate everything that is known is by condi?oning on it, using Bayes rule to update our prior beliefs into posterior beliefs.

Fundamental principle of Bayesian sta?s?cs Everything that is uncertain is modeled with a probability distribu?on. Parameters Hyper-parameters Incorporate everything that is known is by condi?oning on it, using Bayes rule to update our prior beliefs into posterior beliefs. Posterior Prior Likelihood

Advantages of Bayes Conceptually simple Handle small sample sizes Handle complex hierarchical models without overfihng No need to choose between different es?mators, hypothesis tes?ng procedures

Disadvantages of Bayes Need to specify a prior! Computa?onal Issues!

Disadvantages of Bayes Need to specify a prior! Subjec?ve But every model come with its own assump?on Es?mate prior from data -> empirical Bayes

Disadvantages of Bayes Computa?onal Issues! Compu?ng the normaliza?on constant requires integra?ng over all the parameters Compu?ng posterior expecta?ons requires integra?ng over all the parameters

Approximate inference We can evaluate posterior expecta?ons using Monte Carlo integra?on

Monte Carlo Approxima?on In general, compu?ng the distribu?on of a func?on of an random variable using the change of variable is difficult. A powerful way: Generate samples from the distribu?on Use Monte Carlo to approximate the expected value of any func?on of a random variable

Monte Carlo Approxima?on Many useful func?ons that we can approximate

Monte Carlo Approxima?on Suppose we have and We can approximate p(y) by drawing sample from p(x), squaring them, and compu?ng the empirical distribu?on.

Monte Carlo Approxima?on Suppose we have and P(y)

Disadvantages of Bayes Computa?onal Issues! Compu?ng the normaliza?on constant requires integra?ng over all the parameters Compu?ng posterior expecta?ons requires integra?ng over all the parameters

Conjugate priors For simplicity, we will mostly focus on a special kind of prior which has nice mathema?cal proper?es. A prior likelihood posterior as. is said to be conjugate to a if the corresponding has the same func?onal form

Conjugate priors This means the prior family is closed under Bayesian upda?ng. we can recursively apply the rule to update our beliefs as data streams in. -> online learning

Coin Tossing Example Consider the problem of es?ma?ng the probability of heads from a sequence of N coin tosses: Likelihood Prior Posterior

Likelihood: Binomial distribu?on Let X = number of heads in N trials.

Likelihood: Bernoulli Distribu?on Special case of Binomial Binomial distribu?on when N=1 is called the Bernoulli distribu?on. Specially,

Fihng a Bernoulli distribu?on Suppose we conduct N=100 trials and get data D = (1, 0, 1, 1, 0,.) with N 1 heads and N 0 tails. What is?

Fihng a Bernoulli distribu?on Suppose we conduct N=100 trials and get data D = (1, 0, 1, 1, 0,.) with N 1 heads and N 0 tails. What is? Maximum likelihood es?ma?on

Fihng a Bernoulli distribu?on

Fihng a Bernoulli distribu?on Log-likelihood

Fihng a Bernoulli distribu?on Log-likelihood

Fihng a Bernoulli distribu?on

Conjugate priors: The beta-bernoulli model Consider the probability of heads, given a sequence of N coin tosses, X 1,, X N. Likelihood Natural conjugate prior is the Beta distribu?on Posterior is also Beta, with updated counts

The beta distribu?on Beta distribu?on Beta func?on

Beta distribu?on The beta distribu?on

Upda?ng a beta distribu?on Prior is Beta(2,2). Observe 1 head. Posterior is Beta(3,2), so mean shins from 2/4 to 3/5. Prior is Beta(3,2). Observe 1 head. Posterior is Beta(4,2), so mean shins from 3/5 to 4/6.

Sehng the hyper-parameters The prior hyper-parameters can be interpreted as pseudo counts The effec?ve sample size (strength) of the prior is The prior mean is If our prior belief is p(heads) = 0.3, and we think this belief is equivalent to about 10 data points, we just solve

Point Es?ma?on The posterior is our belief state. To convert it to a single best guess (point es?mate), we pick the value that minimizes some loss func?on, e.g., MSE -> posterior mean, 0/1 loss -> posterior mode

Posterior Mean Let N=N 1 + N 0 be the amount of data, and be the amount of virtual data The posterior mean is a convex combina?on of prior mean and MLE N 1 /N Prior MLE

MAP Es?ma?on It is onen easier to compute the posterior mode (op?miza?on) than the posterior mean (integra?on). This is called maximum a posteriori es?ma?on. For the beta distribu?on

Summary of beta-bernoulli model

Bayesian Model Selec?on Face with a set of models of different complexity, how should we choose?

Bayesian Model Selec?on Cross-valida?on Divide training set into N par??ons Train on N-1 par??ons, and evaluate on the rest In total, fihng the model for N?mes

Bayesian Model Selec?on Compute posterior Then compute MAP

Bayesian Model Selec?on Compute posterior Uniform prior over models Then we are picking the model which maximizes Marginal likelihood, Integrated likelihood, Or evidence

Bayes Factors To compare two models, use posterior odds Bayes factor The Bayes factor is a Bayesian version of a likelihood ra?o test, that can be used to compare models of different complexity

Example: Coin Flipping Suppose we toss a coin N=250?mes and observe N 1 =141 heads and N 0 =109 tails

Example: Coin Flipping Suppose we toss a coin N=250?mes and observe N 1 =141 heads and N 0 =109 tails Consider two hypotheses: H 0 : H 1 :

Example: Coin Flipping

Bayesian Occam s Razor Occam s Razor

Bayesian Occam s Razor Occam s Razor Simplest model that adequately explains the data

Bayesian Occam s Razor Occam s Razor Simplest model that adequately explains the data selects models would always favor the model with most parameters MLE, or MAP to es?mate parameters Integrate out the parameters!

Bayesian Occam s Razor Overfihng early samples

Bayesian Occam s Razor Probability over all possible datasets Complex models must spread out their probability mass thinly

Bayesian Occam s Razor Complex models must spread out their probability mass thinly

Marginal likelihood When performing Bayesian model selec?on and empirical Bayes es?ma?on, we will need This is given by a ra?o of the posterior and prior normalizing constants

Summary of beta-bernoulli model

From coins to dice

Mul?nomial: 1 sample One-shot encoding Probability for class k

Likelihood

Conjugate Prior: Dirichlet distribu?on Generaliza?on of Beta to K dimensions Normaliza?on constant

Conjugate Prior: Dirichlet distribu?on Generaliza?on of Beta to K dimensions (20, 20, 20) (2, 2, 2) (20, 2, 2)

Summary of Dirichlet-mul?nomial model

Frequen?st Sta?s?cs We have seen how Bayesian inference offers a principled solu?on to the parameter es?ma?on problem.

Frequen?st Sta?s?cs Parameter es?ma?on MAP es?mate MLE

Why maximum likelihood? KL divergence from the true distribu?on p to the approxima?on q is

Why maximum likelihood? KL divergence from the true distribu?on p to the approxima?on q is Empirical distribu?on

Maximum Likelihood = min KL (to empirical distribu?on) KL divergence to empirical distribu?on

Maximum Likelihood = min KL (to empirical distribu?on) KL divergence to empirical distribu?on Hence minimizing KL is equivalent to minimizing the average nega?ve log likelihood on the training set

Bernoulli MLE Remember that

However Suppose we toss a coin N=3?mes and see 3 tails. We would es?mate the probability of heads as 0.

However Suppose we toss a coin N=3?mes and see 3 tails. We would es?mate the probability of heads as 0. Too few samples -> sparse data!

However Suppose we toss a coin N=3?mes and see 3 tails. We would es?mate the probability of heads as 0. We can add pseudo counts C 0 and C 1 (e.g., 0.1) to the sufficient sta?s?cs N 0 and N 1 to get a beuer behaved es?mate. This is the MAP es?mate using a Beta prior.

MLE for the mul?nomial If x n {1,,K}, the likelihood is The log-likelihood is

Compu?ng the mul?nomial MLE

Compu?ng the mul?nomial MLE

Compu?ng the Gaussian MLE

Compu?ng the Gaussian MLE

Bayesian vs. Frequen?st MLE returns a point es?mate In frequen?st sta?s?cs, we treat D as random and as fixed, and ask how the es?mate would change if D changed. In Bayesian sta?s?cs, we treat D as fixed and as random, and model our uncertainty with the posterior

Unbiased es?mators The bias of an es?mator is defined as An es?mator is unbiased if bias=0.

Unbiased es?mators MLE for Gaussian mean is unbiased

Is being unbiased enough?

Consistent es?mators An es?mator is consistent if it converges (in probability) to the true value with enough data MLE is a consistent es?mator.

Bias-variance tradeoff Being unbiased is not necessarily desirable! Suppose our loss func?on is mean squared error where

Feature Selec?on If predic?ve accuracy is the goal, onen best to keep all predictors and use L2 regulariza?on We onen want to select a subset of the inputs that are most relevant for predic?ng the output, to get sparse models interpretability, speed, possibly beuer predic?ve accuracy

Filter methods Compute relevance of each feature to the label marginally Computa?onally efficient

Correla?on coefficient Measures extent to which X j and Y are linearly related

Correla?on coefficient Mutual informa?on Can model non linear non Gaussian dependencies For discrete data

Wrapper Methods Perform discrete search in model space Wrap search around standard model fihng

Wrapper Methods Forward selec?on for linear regression At each step, add feature that maximally reduces residual error

Wrapper Methods Forward selec?on for linear regression At each step, add feature that maximally reduces residual error

Wrapper Methods Forward selec?on for linear regression Put the es?ma?on in

What we learned today Bayesian Sta?s?cs Frequen?st Sta?s?cs Feature Selec?on Some slides are borrowed from Kevin Murphy s lectures

Homework Read Murphy CH 5, 6 Assignment 1 due 02/04, 6pm! Both hard copy and electronic copy