Generative Models for Discrete Data

Similar documents
Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Bayesian Models in Machine Learning

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Bayesian Methods: Naïve Bayes

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Collaborative Filtering. Radek Pelánek

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Scaling Neighbourhood Methods

Be able to define the following terms and answer basic questions about them:

Andriy Mnih and Ruslan Salakhutdinov

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning

Data Mining Techniques

Gaussian Models

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Bayesian Learning (II)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Machine Learning, Fall 2012 Homework 2

The Bayes classifier

Decoupled Collaborative Ranking

Machine Learning

MLE/MAP + Naïve Bayes

Nonparametric Bayesian Methods (Gaussian Processes)

CSC321 Lecture 18: Learning Probabilistic Models

CS 361: Probability & Statistics

MLE/MAP + Naïve Bayes

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

* Matrix Factorization and Recommendation Systems

Probability Review and Naïve Bayes

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Generative Clustering, Topic Modeling, & Bayesian Inference

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Bayesian Methods for Machine Learning

CS-E3210 Machine Learning: Basic Principles

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

Learning Bayesian network : Given structure and completely observed data

Modeling Environment

Computational Cognitive Science

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Behavioral Data Mining. Lecture 2

Matrix Factorization Techniques for Recommender Systems

CS 361: Probability & Statistics

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Learning Bayesian belief networks

Introduction to Machine Learning CMU-10701

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Probabilistic and Bayesian Machine Learning

The Naïve Bayes Classifier. Machine Learning Fall 2017

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Ad Placement Strategies

Introduction to Machine Learning Midterm Exam Solutions

Algorithms for Collaborative Filtering

CS540 Machine learning L8

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Pattern Recognition and Machine Learning

Machine Learning: Assignment 1

Computational Cognitive Science

HOMEWORK #4: LOGISTIC REGRESSION

Introduction to Probabilistic Machine Learning

CS 188: Artificial Intelligence Spring Today

Machine Learning for Signal Processing Bayes Classification

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Naïve Bayes classification

Introduction to Machine Learning Midterm Exam

CS 6375 Machine Learning

Estimating Parameters

Data Mining 2018 Logistic Regression Text Classification

CHAPTER 2 Estimating Probabilities

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Data Science Mastery Program

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Click Prediction and Preference Ranking of RSS Feeds

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017

Study Notes on the Latent Dirichlet Allocation

Logistic Regression. Machine Learning Fall 2018

Bayesian Linear Regression [DRAFT - In Progress]

Collaborative Filtering Matrix Completion Alternating Least Squares

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

Recommender Systems: Overview and. Package rectools. Norm Matloff. Dept. of Computer Science. University of California at Davis.

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Large-scale Ordinal Collaborative Filtering

Bayesian Analysis for Natural Language Processing Lecture 2

Introduction to Machine Learning

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Be able to define the following terms and answer basic questions about them:

Introduction to Gaussian Process

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

Transcription:

Generative Models for Discrete Data ddebarr@uw.edu 2016-04-21

Agenda Bayesian Concept Learning Beta-Binomial Model Dirichlet-Multinomial Model Naïve Bayes Classifiers

Bayesian Concept Learning Numbers Game Given positive class examples predict membership for remaining numbers [8 people]

Bayesian Concept Learning Likelihood Example: After 1 example v After 4 examples Occam s razor: simpler is better [likelihood ratio is almost 5000:1] v

Bayesian Concept Learning Estimation: MAP v MLE Posterior = Likelihood * Prior / Evidence Maximum A Priori (MAP) Maximum Likelihood Estimate (MLE)

Bayesian Concept Learning Posterior Posterior probabilities for 32 different concepts after observing 1 value powers of 4 it is

Bayesian Concept Learning Posterior Posterior probabilities for 32 different concepts after observing 4 values powers of 2 it is

Bayesian Concept Learning After 1 example Bayesian Model Averaging [BMA] (weighted average)

Bayesian Concept Learning More Complex Prior (to mimic humans) Mixture:

Beta-Binomial Distribution Beta-Binomial Distribution Likelihood Same form whether we observe the counts or a sequence of trials Prior Posterior Conjugate Prior: prior and posterior have the same form

Beta-Binomial Distribution Effect of the Prior Hyperparameters also known as (aka) pseudo counts: the equivalent sample size of the prior

Beta-Binomial Distribution Overfitting and the Black Swan Problem We ve never seen a black swan, so it s not possible Laplace s rule of succession [aka add-one smoothing]

Beta-Binomial Distribution Use the Force [Prior] After observing 3 heads out of 20 coin flips Posterior predictive distribution based on prior of Beta(2, 2): heavier tail Simpler plug-in approximation: exhibiting a bit more certainty

Dirichlet-Multinomial Distribution Dirichlet-Multinomial Distribution Likelihood Prior Posterior

Dirichlet-Multinomial Distribution Bag of Words Representation Warning: consistency issue: count=3 v count=4 for unk

Naïve Bayes Classifiers Naïve Bayes Classifier (NBC) Features assumed to be conditionally independent given the class label Generative model Multivariate Bernoulli Naïve Bayes

Naïve Bayes Classifiers Naïve Bayes Parameter Estimation

Naïve Bayes Classifiers Generative Model Examples Same term spikes for both classes: not so useful for classification purposes

Naïve Bayes Classifiers Log-Sum-Exp Trick Computing the product of a bunch of conditionally independent probabilities can result in numerical underflow [everything looks like zero]; so we compute the sum of log values instead If we re just predicting the class, we only need to compute the numerator; but if we need to estimate the probability, we ll need to compute the denominator as well where

Naïve Bayes Classifiers Naïve Bayes Prediction

Naïve Bayes Classifiers Feature Selection: Filter Method class 1 = x windows versus class 2 = ms windows (450 + 1) / ((450 + 1) + (0 + 1)) = 451 / 452 = 0.998

Naïve Bayes Classifiers Mutual Information Example ( 7 / 904) * log2( ( 7 / 904) / ((114 / 902) * (451 / 902)) ) +... (108 / 904) * log2( (108 / 904) / ((114 / 902) * (451 / 902)) ) +... (445 / 904) * log2( (445 / 904) / ((788 / 902) * (451 / 902)) ) +... (344 / 904) * log2( (344 / 904) / ((788 / 902) * (451 / 902)) ) = 0.095 Note: ((6 + 1) / ((6 + 1) + (444 + 1))) * ((450 + 1) / ((450 + 1) + (450 + 1))) = (7 / 904)

Recommendation Systems

Approaches Collaborative Filtering Using a ratings matrix to impute missing ratings and make recommendations Ratings may be either explicit (e.g. 5 stars) or implicit (e.g. purchase) Popular method: Alternating Least Squares (ALS) Cold start problem: new users and new items won t have ratings Content Filtering Using features of users (e.g. age, gender, location) and items (e.g. release date, genre, leading actress for a movie) to make recommendations Supervised learning methods apply Example: ICDM 2013 Competition to Personalize Expedia Hotel Search Results: gradient boosting for the Normalized Discounted Cumulative Gain (NDCG) loss function If ratings are available, a collaborative filtering prediction can be used as a feature [stacking: treating the output of one model as an input for another]

Collaborative Filtering: Nearest Neighbor User-based model: Assumes similar users will have similar ratings for the same item Item-based model: Assumes similar items will have similar ratings from the same user

Nearest Neighbor Rating Example users are indexed by u and v s measures similarity between users r represents a rating for item i We can replace with (for each user) to adjust for differences in mean rating between users

Alternating Least Squares Matrix Factorization UserLatentFactorMatrix * ItemLatentFactorMatrix = RatingMatrix [m users, f factors] * [ f factors, n items ]

Alternating Least Squares Algorithm 1. Initialize the items matrix with random entries 2. Repeat until convergence ( no change) or max iterations reached: a) Fix the items matrix, and solve for regression coefficients for each row of the users matrix [a model for each user] b) Fix the users matrix, and solve for regression coefficients for each column of the items matrix [a model for each item] The regression coefficients form latent (unobserved) factors for users and items

References Matrix Factorization Techniques for Recommender Systems [from part of the winning team for the $1 million NetFlix prize] Section 27.6.2 of our text book addresses Probabilistic Matrix Factorization for Collaborative Filtering

Apache Spark Apache Spark is a data processing platform, based on the concept of the Resilient Distributed Dataset (RDD) Unlike Hadoop, the RDD persists data to memory across iterations Python API http://spark.apache.org/docs/latest/api/python/index.html We ll be using the ALS class of the recommendation module within the pyspark.mllib package: pyspark.mllib.recommendation.als

Strong versus Weak Scalability Strong scalability Given 10 times as many computers, we should ideally be able to generate a model using the same data in 1/10 th as much time [compared to using 1 computer: time shrinks] Example: random forests Weak scalability Given 10 times as many computers, we should ideally be able to generate a model using 10 times as much data in the same amount of time [compared to using 1 computer: data grows (big data)] Example: gradient descent