CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

Similar documents
CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring 2016

CS 6140: Machine Learning Spring 2017

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Logis&c Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Generative Model (Naïve Bayes, LDA)

UVA CS / Introduc8on to Machine Learning and Data Mining

Bias/variance tradeoff, Model assessment and selec+on

Regression.

Machine Learning 2nd Edi7on

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on

Machine Learning

Computer Vision. Pa0ern Recogni4on Concepts Part I. Luis F. Teixeira MAP- i 2012/13

The Naïve Bayes Classifier. Machine Learning Fall 2017

Decision Trees Lecture 12

Least Mean Squares Regression. Machine Learning Fall 2017

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Introduction to Bayesian Learning. Machine Learning Fall 2018

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Computer Vision. Pa0ern Recogni4on Concepts. Luis F. Teixeira MAP- i 2014/15

UVA CS 4501: Machine Learning. Lecture 6: Linear Regression Model with Dr. Yanjun Qi. University of Virginia

MLE/MAP + Naïve Bayes

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

MLE/MAP + Naïve Bayes

Boos$ng Can we make dumb learners smart?

MODULE -4 BAYEIAN LEARNING

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Logistic Regression. Machine Learning Fall 2018

Logic and machine learning review. CS 540 Yingyu Liang

STAD68: Machine Learning

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Parameter Es*ma*on: Cracking Incomplete Data

Graphical Models. Lecture 3: Local Condi6onal Probability Distribu6ons. Andrew McCallum

CS 6375 Machine Learning

Midterm, Fall 2003

Machine Learning, Fall 2012 Homework 2

Lecture 2 Machine Learning Review

Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP

Founda'ons of Large- Scale Mul'media Informa'on Management and Retrieval. Lecture #3 Machine Learning. Edward Chang

Machine Learning

Notes on Machine Learning for and

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Machine Learning

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 8: Hidden Markov Models

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Gaussian and Linear Discriminant Analysis; Multiclass Classification

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

CSC321 Lecture 18: Learning Probabilistic Models

CS540 Machine learning L9 Bayesian statistics

Naïve Bayes classification

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Classification: Decision Trees

Overfitting, Bias / Variance Analysis

Bellman s Curse of Dimensionality

Generative Models for Discrete Data

Generative v. Discriminative classifiers Intuition

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Discrimina)ve Latent Variable Models. SPFLODD November 15, 2011

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

CSE 546 Final Exam, Autumn 2013

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

CPSC 340: Machine Learning and Data Mining

Learning with Probabilities

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Midterm: CS 6375 Spring 2018

Introduc)on to Ar)ficial Intelligence

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Decision Trees. Gavin Brown

From inductive inference to machine learning

Introduction to Machine Learning

Where are we? è Five major sec3ons of this course

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Learning Decision Trees

Bayesian networks Lecture 18. David Sontag New York University

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

FINAL: CS 6375 (Machine Learning) Fall 2014

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 11: Hidden Markov Models

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

CPSC 540: Machine Learning

Machine Learning Practice Page 2 of 2 10/28/13

Decision Tree Learning

Statistical Learning. Philipp Koehn. 10 November 2015

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Linear Regression with mul2ple variables. Mul2ple features. Machine Learning

Machine Learning and Deep Learning! Vincent Lepetit!

Unsupervised learning (part 1) Lecture 19

CSC 411 Lecture 3: Decision Trees

Last Lecture Recap UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 3: Linear Regression

ECE521 week 3: 23/26 January 2017

Lecture 3: Decision Trees

Day 3: Classification, logistic regression

Bias-Variance Tradeoff

Learning Decision Trees

Transcription:

Logis@cs CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Sign up at Piazza: hlp://piazza.com/northeastern/spring2016/cs6140 All course-relevant ques@ons go here! Textbook Assignment 1 Analy@cal ques@ons Simple programming Logis@cs Exam Open book Computer, but no internet Project Proposal: problem defini@on, related work, poten@al model and algorithms, datasets, evalua@on What We Learned Last Week Basic Concept Supervised learning vs. unsupervised learning Parametric vs. non-parametric Classifica@on vs. regression Training set, test set, development set OverfiYng vs. underfiyng K-Nearest Neighbors Linear Regression Ridge Regression KNN Linear Regression Assump@on: the response is a linear func@on of the inputs Inner product between input sample X and weight vector W Residual error: difference between predic@on and true label 1

We want to minimize Ridge Regression Today s Outline Genera@ve Model and Discrimina@ve Model Logis@c Regression Genera@ve Models Genera@ve Models vs. Discrimina@ve Models Decision Tree Genera@ve model Learn P(X, Y) from training sample P(X, Y)=P(Y)P(X Y) Specifies how to generate the observed features x for y Discrimina@ve model Learn P(Y X) from training sample Directly models the mapping from features x to y Logis@c Regression A discrimina@ve model Logis@c Regression A discrimina@ve model y is 0 or 1 Ber is a Bernoulli distribu@on Classifica@on! Not really regression y is 0 or 1 Ber is a Bernouli distribu@on Remember in linear regression 2

Logis@c Regression Sigmoid func@on A discrimina@ve model Defini@on sigm is sigmoid func@on Logis@c Regression Logis@c Regression A discrimina@ve model sigm is sigmod func@on Logis@c Regression Logis@c Regression A discrimina@ve model sigm is sigmod func@on 3

Logis@c Regression A discrimina@ve model Parameter Es@ma@on Nega@ve log-likelihood Parameter es@ma@on How to get w? Parameter Es@ma@on Gradient and Hessian Where MLE won t work Our objec@ve func@on is convex à unique global minimum Parameter Es@ma@on Gradient Descent Example Gradient descent is the step size 4

Changing Step Size Changing Step Size Guarantee to converge to local op@mum Gradient Descent Direc0on Remember that we want to have Line search: find step size by minimizing Parameter Es@ma@on: Newton s Method In gradient descent Parameter Es@ma@on: Newton s Method Newton s method: second order op@miza@on Consider a second-order Taylor series approxima@on of objec@ve func@on at step k Newton s method: second order op@miza@on Faster op@miza@on Parameter Es@ma@on: Newton s Method Newton s method: second order op@miza@on Consider a second-order Taylor series approxima@on of objec@ve func@on at step k Parameter Es@ma@on: Newton s Method Newton s method: second order op@miza@on Consider a second-order Taylor series approxima@on of objec@ve func@on at step k Rewrite into Rewrite into where Update func@on 5

Parameter Es@ma@on: Newton s Method Now apply Newton s method to our problem Parameter Es@ma@on: Newton s Method Adding L2 Regulariza@on Adding L2 Regulariza@on To avoid overfiyng Genera@ve model Learn P(X, Y) from training sample P(X, Y)=P(Y)P(X Y) Specifies how to generate the observed features x for y Discrimina@ve model Learn P(Y X) from training sample Directly models the mapping from features x to y 6

Bayesian Concept Learning Bayesian Inference How do human beings learn from everyday life? Meanings of words Causes of a person s ac@on Future outcomes of a dynamic process. Some of the slides are borrowed from Kevin Murphy s Lectures Bayesian Inference Number Game Observe one or more examples Judge whether other numbers are yes or no Hypothesis space: H Prior p(h) Likelihood p(d h) Compu@ng posterior p(h D) Number Game Generaliza@on from posi@ve samples 7

Bayesian model H: Hypothesis space of possible concepts X: n examples of a concept C Evaluate hypotheses given data using Bayes rule: Hypothesis space Mathema@cal proper@es (~50) odd, even, square, cube, prime, mul@ples of small integers powers of small integers same first (or last) digit Magnitude intervals (~5000): all intervals of integers with endpoints between 1 and 100 Likelihood Size principle: Smaller hypotheses receive greater likelihood, and exponen@ally more so as n increases. Occam s razor The model favors the simplest or smallest hypothesis consistent with the data Likelihood Occam s razor The model favors the simplest or smallest hypothesis consistent with the data D={16} h1: powers of two under 100 h2: event numbers under 100 P(D h1)=1/6 P(D h2)=1/50 Prior X={60,80,10,30} 8

Prior Posterior X={60,80,10,30} Why prefer mul@ples of 10 over even numbers? Why prefer mul@ples of 10 over mul@ples of 10 except 50 and 20? Cannot learn efficiently if we have a uniform prior over all 2 100 logically possible hypotheses Posterior predic@ve distribu@on Bayesian model averaging Posterior predic@ve distribu@on Maximum a posteriori (MAP) Or plug-in approxima@on 9

Naïve Bayes Naïve Bayes Document classifica@on example Y {1,,C}, x {0,1} d Y {spam, urgent, normal} xi = 1 (word i is present in message) Bayes Rules Class condi@onal density p(x y=c) Assump@on: features are independent assignment, released Class condi@onal density p(x y=c) Mul@variate Poisson Class condi@onal density p(x y=c) Mul@nomial Formally, 10

Class condi@onal density p(x y=c) Binary features: mul@variate Bernoulli Class condi@onal density p(x y=c) Binary features: mul@variate Bernoulli Commonly used Bayes Rules Class Prior Let (Y 1,..,Y C ) ~ Mult(,, 1) be the class prior 1-of-C encoding: only one bit can be on e.g., p(spam)=0.7, p(urgent)=0.1, p(normal)=0.2 Bayes Rules Class Posterior Fill with class condi@onal probability and prior 11

Class Posterior Log-Sum-Exp Trick Fill with class condi@onal probability and prior Numerator and denominator are very small numbers, use logs to avoid underflow How to compute the normaliza@on constant? Log-Sum-Exp Trick Parameter Es@ma@on So far we have assumed that the parameters of p(x y=c) and p(y=c) are known. To es@mate p(y=c), we can use MLE or MAP or fully Bayesian es@ma@on of a mul@nomial Parameter Es@ma@on To es@mate p(x y=c): MLE for Bernoulli features For each feature, we just count how many @mes word j occurred in documents of class c, and divide by the number of documents of class c Plug-in Approxima@on We can compute MLEs for each feature j and class c separately 12

Then we have Plug-in Approxima@on Genera@ve model Learn P(X, Y) from training sample P(X, Y)=P(Y)P(X Y) Specifies how to generate the observed features x for y Discrimina@ve model Learn P(Y X) from training sample Directly models the mapping from features x to y Easy to fit the model Genera@ve model! Fit classes separately Genera@ve model! Handle missing features easily Genera@ve model! Handle unlabeled training data Easier for Genera@ve model! 13

Symmetric in inputs and outputs Genera@ve model! Define p(x,y) Handle feature preprocessing Discrimina@ve model! Decision Tree Well-calibrated probabili@es Discrimina@ve model! [some of the slides are borrowed from Tom Mitchell s lecture] Decision Tree Decision Tree Play tennis? Play tennis? Each internal node: test one alribute X i Each branch from a node: selects one value for X i Each leaf node: predict Y (or P(Y X leaf)) 14

Top-Down Induc@on of Decision Trees Top-Down Induc@on of Decision Trees Which alribute is best? Entropy Entropy Entropy H(X) of a random variable X H(X) is the expected number of bits needed to encode a randomly drawn value of X (under most efficient code) Informa@on Gain Gain(S,A)=expected reduc@on in entropy due to sor@ng on A 15

Informa@on Gain OverfiYng Avoid OverfiYng Stop growing when data split is not sta@s@cally significant Grow a full-tree, then prune Reduce-Error Pruning 16

Rule Post-Pruning What We Learned Today Genera@ve Model and Discrimina@ve Model Logis@c Regression Homework Reading Murphy Ch 3, 8.1-8.3, 8.6, 16.2 First assignment is out! Genera@ve Models Genera@ve Models vs. Discrimina@ve Models Decision Tree 17