Decision-making, inference, and learning theory. ECE 830 & CS 761, Spring 2016

Similar documents
10. Composite Hypothesis Testing. ECE 830, Spring 2014

ECE521 week 3: 23/26 January 2017

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Lecture : Probabilistic Machine Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Introduction. Chapter 1

ECE521 Lecture7. Logistic Regression

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Density Estimation. Seungjin Choi

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

Bayesian Learning (II)

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

CS534 Machine Learning - Spring Final Exam

Mining Classification Knowledge

Topic 3: Hypothesis Testing

Naïve Bayes classification

y Xw 2 2 y Xw λ w 2 2

Linear Models for Regression CS534

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Machine learning - HT Maximum Likelihood

Learning Classification with Auxiliary Probabilistic Information Quang Nguyen Hamed Valizadegan Milos Hauskrecht

Detection theory. H 0 : x[n] = w[n]

Mathematical Formulation of Our Example

15-388/688 - Practical Data Science: Basic probability. J. Zico Kolter Carnegie Mellon University Spring 2018

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Undirected Graphical Models

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Linear & nonlinear classifiers

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017

CMU-Q Lecture 24:

PATTERN RECOGNITION AND MACHINE LEARNING

CPSC 540: Machine Learning

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Week 5: Logistic Regression & Neural Networks

Lecture 3: Pattern Classification

Nonparametric Bayesian Methods (Gaussian Processes)

Lecture 5: GPs and Streaming regression

15-381: Artificial Intelligence. Decision trees

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Learning from Data 1 Naive Bayes

Latent Variable Models and Expectation Maximization

Uncertainty Quantification for Machine Learning and Statistical Models

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Homework 6: Image Completion using Mixture of Bernoullis

Gaussian with mean ( µ ) and standard deviation ( σ)

Gaussian Processes (10/16/13)

Probability Theory for Machine Learning. Chris Cremer September 2015

Discovering molecular pathways from protein interaction and ge

Clustering and Gaussian Mixtures

Probability and Information Theory. Sargur N. Srihari

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Non-parametric Methods

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Linear Models for Regression CS534

CS-E3210 Machine Learning: Basic Principles

Latent Variable Models and Expectation Maximization

MODULE -4 BAYEIAN LEARNING

Introduction to Stochastic Processes

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Parametric Techniques Lecture 3

Massachusetts Institute of Technology

Mining Classification Knowledge

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Gentle Introduction to Infinite Gaussian Mixture Modeling

Linear Models for Regression

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

PATTERN RECOGNITION AND MACHINE LEARNING

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi

Integrating Globus into a Science Gateway for Cryo-EM

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Overview. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Module 2. Random Processes. Version 2, ECE IIT, Kharagpur

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Parametric Techniques

ECE 564/645 - Digital Communications, Spring 2018 Homework #2 Due: March 19 (In Lecture)

Programming Assignment 4: Image Completion using Mixture of Bernoullis

Machine Learning Basics III

Machine Learning Practice Page 2 of 2 10/28/13

Announcements. Midterm Review Session. Extended TA session on Friday 9am to 12 noon.

Support Vector Machine. Industrial AI Lab.

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

LECTURE NOTE #3 PROF. ALAN YUILLE

Applied Machine Learning Annalisa Marsico

Machine Learning

Machine Learning for Large-Scale Data Analysis and Decision Making A. Week #1

CS 6375 Machine Learning

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Introduction to Supervised Learning. Performance Evaluation

Transcription:

Decision-making, inference, and learning theory ECE 830 & CS 761, Spring 2016 1 / 22

What do we have here? Given measurements or observations of some physical process, we ask the simple question what do we have here? For instance, What is the value of underlying model parameters? Do my measurements fall in category A or B? Can I predict what measurements I will collect under different conditions? Is there any information in my measurements, or are they just noise? 2 / 22

A major difficulty In many machine learning (ML) and signal processing (SP) applications, we don t have complete or perfect knowledge of the signals we wish to precess. We are faced with many unknowns and uncertainties. Examples: Unknown signal parameters (delay of radar return, pitch of speech signal) Environmental noise (multipath signals in wireless communications, ambient electromagnetic waves) Sensor noise (grainy images, old phonograph recordings) Missing data (dropped packets, occlusions, partial labels) Variability inherent in nature (stock market, internet) Statistical signal processing and machine learning address how to process data in the face of such uncertainty. 3 / 22

Components of SP and ML: Modeling, Measurement, and Inference Step 1: Postulate a probability model (or collection of models) that can be expected to reasonably capture the uncertainties in the data Step 2: Collect data. Step 3: Formulate statistics that allow us to interpret or understand our probability models. 4 / 22

Modeling uncertainty There are many ways to model these sorts of uncertainties. In this course we will model them probabilistically. Let p(x θ) denote a probability distribution parameterized by θ. The parameter θ could represent characteristics of errors or noise in the measurement process or govern inherent variability in the signal itself. Example: Univariate Gaussian model If x is a scalar measurement then we could have the Gaussian probabilistic model p(x θ) = 1 2π exp( (x θ) 2 /2), a model which says that typically x is close to the value of θ and rarely is very different. 5 / 22

Classes of probabilistic models fixed: p(x θ) is fully known. e.g. bit errors in a communication system are modeled using Bernoulli random variables with a known bit error rate, or sensor noise modeled as an additive Gaussian random variable parametric: p(x θ) has a known form, but there are some unknown parameters. e.g. uncertainty in the number of photons striking a CCD per unit time modeled as a Poisson random variable, but the average photon arrival rate is a unknown parameter nonparametric: p(x θ) belongs to a very large, rich family of models. e.g. people s political leanings have a complex dependency on their family backgrounds, education, socio-economic status, habitat, age,... this distribution cannot be expressed using a simple parametric model distribution free: we avoid any explicit modeling assumptions on p(x θ) and rely on weaker assumptions. e.g. observations are independent and bounded or subgaussian. 6 / 22

7 / 22

Networks of interacting neurons 8 / 22

Networks of interacting neurons 8 / 22

Networks of interacting neurons 8 / 22

Networks of interacting neurons 8 / 22

Networks of interacting neurons 8 / 22

Networks of interacting neurons 8 / 22

Networks of interacting neurons 8 / 22

Networks of interacting neurons 8 / 22

Networks of interacting neurons 8 / 22

Networks of interacting neurons 8 / 22

Networks of interacting neurons 8 / 22

Networks of interacting neurons 8 / 22

Networks of interacting neurons 8 / 22

Networks of interacting neurons 8 / 22

Networks of interacting neurons 8 / 22

Networks of interacting neurons 8 / 22

Networks of interacting neurons Probabilistic models capture the dependencies among neuron firing times. 8 / 22

Three main problems There are three fundamental inference problems in machine learning and statistical signal processing that will be the focus of this course: 1. Decision-making, testing, and detection 2. Regression and estimation 3. Learning and prediction 9 / 22

Problem 1: Decision-making, testing, and detection Example: Decision-making Suppose that θ takes one of two possible values, so that either p(x θ 1 ) or p(x θ 2 ) fit the data x the best. Then we need to decide whether p(x θ 1 ) is a better model than p(x θ 2 ). More generally, θ may be one of a finite number of values {θ 1,..., θ M } and we must decide among the M models. 10 / 22

A Decision-Making Example 10 2 4 Consider a binary communication system. Let s = [s 1,..., s n ] denote a digitized waveform. A transmitter communicates a bit of information by sending s or s (for 1 or 0, respectively). The receiver measures a noisy version of the transmitted signal. 8 6 4 2 0 data s s 6 0 0.2 0.4 0.6 0.8 1 i Our task is to (a) model the data as a function of s and (b) use that model to determine whether s or s was transmitted. 11 / 22

Problem 2: Regression and estimation of θ Suppose that θ belongs to an infinite set. Then we must decide or choose among an infinite number of models. In this sense, estimation may be viewed as an extension of detection to infinite model classes. This extension presents many new challenges and issues and so it is given its own name. 12 / 22

A Parameter Estimation Example Example: Radar Range Estimation 13 / 22

An Nonparametric Estimation Example Example: Image restoration Imagine that you are collaborating with biologists who are interested in imaging biological systems using a new type of microscopy. The imaging system doesn t produce perfect images: the data collected is distorted and noisy. As a signal processing expert, you are asked to develop an image processing algorithm to restore the image. http://www.nature.com/srep/2013/130828/srep02523/full/srep02523.html 14 / 22

Our task is to (a) form a probabilistic model p(x θ) and (b) estimate θ to restore the image. Example: Image restoration (cont.) Let us assume that the distortion is a linear operation. Then we can model the collected data by the following equation. where x = Hθ + w θ is the ideal image we wish to recover (represented as a vector, each element of which is a pixel), H is a known model of the distortion (represented as a matrix), and w is a vector of noise. 15 / 22

Problem 3: Learning and prediction In many problems we wish to predict the value of a label y given an observation of related data x. The conditional distribution of y given x is denoted by p(y x) (or p(y x; θ)) and the prediction problem can then be viewed as determining a value of y that is highly probable given x. Sometimes we don t know a good model the relationship between x and y, but we do have a number of training examples, say {x i, y i } n i=1, that give us some indication of the relationship. The goal of learning is to design a good prediction rule for y given x using these examples, instead of p(y x). 16 / 22

A Learning Example Example: 21 and me Now imagine you are working with geneticists to develop a diagnostic tool to predict whether patients have a certain disease. The tool is to be based on genomic data from the patient. For example, suppose that a microarray experiment is used to measure the levels of gene expression in the patient. For each of m genes we have an expression level (which reflects the amount of protein that gene is producing). Let x denote an m 1 vector of the expression levels and let y denote a binary variable indicating whether or not the patient has the disease. 17 / 22

A Learning Example (cont.) Example: 21 and me (cont.) Some predictions are more difficult than others... http://imgs.xkcd.com/comics/genetic analysis.png 18 / 22

The Netflix problem 19 / 22

The Netflix prize 20 / 22

Example: Predicting Netflix ratings Here x contains the measured movie ratings and y are the unknown movie ratings we wish to predict. 21 / 22

Example: Predicting Netflix ratings (cont.) One probabilistic model says the underlying matrix of true ratings can be factored into the product of two smaller matrices. 22 / 22

Example: Predicting Netflix ratings (cont.) One probabilistic model says the underlying matrix of true ratings can be factored into the product of two smaller matrices. 22 / 22