Machine Learning 4771

Similar documents
Notes on Discriminant Functions and Optimal Classification

Machine Learning 4771

MACHINE LEARNING ADVANCED MACHINE LEARNING

MACHINE LEARNING ADVANCED MACHINE LEARNING

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Machine Learning Basics III

Machine Learning 4771

CSC321 Lecture 18: Learning Probabilistic Models

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

7 Gaussian Discriminant Analysis (including QDA and LDA)

Naïve Bayes classification

Bayesian Methods: Naïve Bayes

Machine Learning, Fall 2012 Homework 2

Unsupervised Learning

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

STA 414/2104, Spring 2014, Practice Problem Set #1

CMU-Q Lecture 24:

Lecture 3: Pattern Classification. Pattern classification

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning Lecture 3

CPSC 540: Machine Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Artificial Neural Networks 2

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Gaussian Process Vine Copulas for Multivariate Dependence

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning for Signal Processing Bayes Classification and Regression

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Latent Variable Models

ECE521 week 3: 23/26 January 2017

Support Vector Machines

Probability and Information Theory. Sargur N. Srihari

Linear Models for Classification

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Bayesian Machine Learning

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

2 Statistical Estimation: Basic Concepts

Bayes Decision Theory - I

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Introduction to Bayesian Learning. Machine Learning Fall 2018

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Computational Genomics

6.867 Machine learning

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Machine Learning

Lecture : Probabilistic Machine Learning

CS229 Supplemental Lecture notes

Statistical Data Mining and Machine Learning Hilary Term 2016

Logistic Regression. Machine Learning Fall 2018

Week 5: Logistic Regression & Neural Networks

Advanced Introduction to Machine Learning CMU-10715

Machine Learning 4771

Lecture 1: Bayesian Framework Basics

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Logistic Regression. COMP 527 Danushka Bollegala

Statistical learning. Chapter 20, Sections 1 4 1

Advanced Machine Learning & Perception

Introduction to Machine Learning

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Pattern Recognition. Parameter Estimation of Probability Density Functions

6.036 midterm review. Wednesday, March 18, 15

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Generative v. Discriminative classifiers Intuition

Computer Vision Group Prof. Daniel Cremers. 3. Regression

CS 6375 Machine Learning

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky.

Notes on Noise Contrastive Estimation (NCE)

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Mobile Robot Localization

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Lecture 2: Simple Classifiers

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Inf2b Learning and Data

Loss Functions, Decision Theory, and Linear Models

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Bias-Variance Tradeoff

Bayesian Interpretations of Regularization

Based on slides by Richard Zemel

Warm up: risk prediction with logistic regression

Generative Clustering, Topic Modeling, & Bayesian Inference

Advanced Machine Learning & Perception

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Transcription:

Machine Learning 4771 Instructor: Tony Jebara

Topic 7 Unsupervised Learning Statistical Perspective Probability Models Discrete & Continuous: Gaussian, Bernoulli, Multinomial Maimum Likelihood Logistic Regression Conditioning, Marginalizing, Bayes Rule, Epectations Classification, Regression, Detection Dependence/Independence Maimum Likelihood aïve Bayes

Unsupervised Learning Classification O O O O O O O O O Regression, f()=y Density/Structure Estimation Clustering Supervised Feature Selection Anomaly Detection Unsupervised (can help supervised)

Statistical Perspective Several problems with framework so far: Only have input-output approaches (SVM, eural et) Pulled non-linear squashing functions out of a hat Pulled loss functions (squared error, etc.) out of a hat Better approach for classification? What if we have multi-class classification? What if other problems, i.e. unobserved values of,y,etc Also, what if we don t have a true function? Eample of Projectile Cannon (c.f. Distal Learning) Would like to train a regression function to control a cannon s angle of fire (y) given target distance ()

Statistical Perspective Eample of Projectile Cannon (45 degree problem) = input target distance y = output cannon angle 2 = v 0 g sin( 2y ) + noise output angle input distance What does least squares do? Conditional statistical models address this problem

Probability Models Instead of deterministic functions, output is a probability Previously: our output was a scalar ŷ = f ow: our output is a probability e.g. a probability bump: p y p( y) = θ T +b p y subsumes or is a superset of Why is this representation for our answer more general?

Probability Models Instead of deterministic functions, output is a probability Previously: our output was a scalar ŷ = f ow: our output is a probability e.g. a probability bump: p y p( y) = θ T +b p y subsumes or is a superset of Why is this representation for our answer more general? A deterministic answer with complete confidence is like putting a probability p y where all the mass is at! ŷ p( y) = δ( y ŷ) p ( y )

Probability Models ow: our output is a probability density function (pdf) Probability Model: a family of pdf s with adjustable parameters which lets us select one of many p y E.g.: 1-dim Gaussian distribution given mean parameter µ: p y µ Want mean centered on f() s value ow, linear regression is: = 1 2π e 1 2 y f y f p( y Θ) = ( y µ ) = 1 y µ 2π e 1 2 ( ) 2 = 1 2π e 1 2( y θt b) 2 2 p y p( y) = y f ( ) p( y)

Probability Models To fit to data, we typically maimize likelihood of the probability model Log-likelihood = objective function (i.e. negative of cost) for probability models which we want to maimize Define (conditional) likelihood as or log-likelihood as l Θ = p( y i ) i L Θ = log L( Θ) = log p y i i For Gaussian p(y ), maimum likelihood is least squares! log p( y i ) i = log ( y i f ( )) i = log 1 2π = log( 1 2π) ( y f ( )) 2 2 i i ( ) 2 e 1 2 y i f i

Probability Models Can etend probability model to 2 bumps: p y Θ = 1 ( y µ ) + 1 ( y µ ) 2 1 2 2 Each mean can be a linear regression fn. = 1 y f ( ) 2 1 p y,θ + 1 ( y f ( )) 2 2 + 1 ( y θ T +b ) 2 2 2 = 1 y θ T +b 2 1 1 Therefore the (conditional) log-likelihood to maimize is: l ( Θ) = log( 1 ( y θ T 2 i 1 i +b ) 1 + 1 ( y θ T +b )) 2 i 2 i 2 Maimize l(θ) using gradient ascent icely handles the cannon firing data

Probability Models ow classification: can also go beyond deterministic! Previously: wanted output to be binary ow: our output is a probability e.g. a probability table: p y y=0 y=1 0.73 0.27 This subsumes or is a superset again Consider probability over binary events (coin flips!): e.g. Bernoulli distribution (i.e 12 probability table) with parameter α p( y α) = α y ( 1 α) 1 y α 0,1 Linear classification can be done by setting α equal to f(): p( y ) = f ( ) y ( 1 f ( ) ) 1 y f ( ) 0,1 α

Probability Models ow linear classification is: p( y ) = f ( ) y ( 1 f ( ) ) 1 y f Log-likelihood is (negative of cost function): log p y i i = logf i y i = y i logf i 1 f i α 0,1 1 y i + 1 y i But, need a squashing function since f() in [0,1] Use sigmoid or logistic again f Called logistic regression new loss function Do gradient descent, similar to logistic output neural net! Can also handle multi-layer f() and do backprop again! log 1 f i = log f i + log 1 f i i class1 [0,1] = sigmoid θ T +b i class 0

Generative Probability Models Idea: Etend probability to describe both X and Y Find probability density function over both: p,y E.g. describe data with Multi-Dim. Gaussian (later ) Called a Generative Model because we can use it to synthesize or re-generate data similar to the training data we learned from Regression models & classification boundaries are not as fleible don t keep info about X don t model noise/uncertainty?

Properties of PDFs Let s review some basics of probability theory First, pdf is a function, multiple inputs, one output: p( 1,, ) p ( = 0.3,, = 1 ) n = 0.2 1 n Function s output is always non-negative: p( 1,, ) n 0 Can have discrete or continuous or both inputs: p( 1 = 1, 2 = 0, 3 = 0, 4 = 3.1415) Summing over the domain of all inputs gives unity: p(,y) ddy = 1 p(,y) = 1 0.4 0.1 y= = Continuous integral, Discrete sum y 0.3 0.2

Properties of PDFs Marginalizing: integrate/sum out a variable leaves a marginal distribution over the remaining ones p,y = p y Conditioning: if a variable y is given we get a conditional distribution over the remaining ones = p (,y ) p y p( y) Bayes Rule: mathematically just redo conditioning but has a deeper meaning (1764) if we have X being data and θ being a model posterior likelihood = p X θ p θ X p θ p X prior evidence

Properties of PDFs Epectation: can use pdf p() to compute averages and epected values for quantities, denoted by: E p ( ) Properties: Mean: epected value for E p ( ) { f ( ) } = p( ) f ( ) d or = p { } = p( ) d f ( ) eample: speeding ticket Fine=0$ Fine=20$ 0.8 0.2 epected cost of speeding? f(=0)=0, f(=1)=20 p(=0)=0.8, p(=1)=0.2 Variance: epected value of (-mean) 2, how much varies Var E E { cf ( ) } = ce { f ( ) } { f ( ) +c} = E { f ( ) } +c { f ( ) } { } { } = E f E E 2 = E 2 2E = E { 2 } 2E { }E { } + E { } = E E { } { { } + E { } 2 } { } 2 = E 2 { } E { } 2

Properties of PDFs Covariance: how strongly and y vary together Cov {,y} = E E { } { } { } E { }E y Conditional Epectation: Sample Epectation: If we don t have pdf p(,y) can approimate epectations using samples of data f ( ) f ( ) i E p ( ) { } = p E E { y } { } 1 {( y E y )} = E y { } = 1 Sample Mean: E Sample Var: E ( E ( ) ) 2 1 Sample Cov: E ( E ( ) ) y E y { } = p y E y p y y i { ( )} 1 y y dy d y dy = E { y} ( i ) 2 { } ( i ) y i y

More Properties of PDFs Independence: probabilities of independent variables multiply. Denote with the following notation: y p(,y) = p( )p( y) y p y = p( ) also note in this case: { y} = p( )p( y)y d dy E p,y y = p( ) d p( y)y dy Conditional independence: when two variables become independent only if another is observed y z y z p y,z y = p( z) p( ) p y = E p ( ) { }E p ( y ) { y}

The IID Assumption Most of the time, we will assume that a dataset independent and identically distributed (IID) In many real situations, data is generated by some black bo phenomenon in an arbitrary order. Assume we are given a dataset: X = { 1,, } Independent means that (given the model θ) the probability of our data multiplies: = p ( i i Θ) p 1,, Θ Identically distributed means that each marginal probability is the same for each data point p( 1,, Θ) = p ( i i Θ) = p( i Θ)

The IID Assumption Bayes rule says likelihood is probability of data given model posterior X = { 1,, } p( X Θ) = p( 1,, Θ) = p ( i i Θ) The likelihood of Learn joint distribution Learn conditional p θ X under IID assumptions is: = p( i Θ) by maimum likelihood: Θ * = arg ma Θ p( i Θ) = arg ma Θ log p i Θ likelihood p θ = p X θ p( Θ) p y,θ p X evidence prior by ma conditional likelihood: Θ * = arg ma Θ p( y i i,θ) = arg ma Θ log p y i i,θ

Uses of PDFs Classification: have p(,y) and given. Asked for discrete y output, give most probable one p( y ) ŷ = arg ma m p( y = m ) p,y Regression: have p(,y) and given. Asked for a scalar y output, give most probable or epected one { } p,y i,y i p( y ) ŷ = Anomaly Detection: if have p(,y) and given both,y. Asked if it is similar threshold threshold { normal,anomaly } p,y { y} arg ma y p y E p y