Decision-making, inference, and learning theory ECE 830 & CS 761, Spring 2016 1 / 22
What do we have here? Given measurements or observations of some physical process, we ask the simple question what do we have here? For instance, What is the value of underlying model parameters? Do my measurements fall in category A or B? Can I predict what measurements I will collect under different conditions? Is there any information in my measurements, or are they just noise? 2 / 22
A major difficulty In many machine learning (ML) and signal processing (SP) applications, we don t have complete or perfect knowledge of the signals we wish to precess. We are faced with many unknowns and uncertainties. Examples: Unknown signal parameters (delay of radar return, pitch of speech signal) Environmental noise (multipath signals in wireless communications, ambient electromagnetic waves) Sensor noise (grainy images, old phonograph recordings) Missing data (dropped packets, occlusions, partial labels) Variability inherent in nature (stock market, internet) Statistical signal processing and machine learning address how to process data in the face of such uncertainty. 3 / 22
Components of SP and ML: Modeling, Measurement, and Inference Step 1: Postulate a probability model (or collection of models) that can be expected to reasonably capture the uncertainties in the data Step 2: Collect data. Step 3: Formulate statistics that allow us to interpret or understand our probability models. 4 / 22
Modeling uncertainty There are many ways to model these sorts of uncertainties. In this course we will model them probabilistically. Let p(x θ) denote a probability distribution parameterized by θ. The parameter θ could represent characteristics of errors or noise in the measurement process or govern inherent variability in the signal itself. Example: Univariate Gaussian model If x is a scalar measurement then we could have the Gaussian probabilistic model p(x θ) = 1 2π exp( (x θ) 2 /2), a model which says that typically x is close to the value of θ and rarely is very different. 5 / 22
Classes of probabilistic models fixed: p(x θ) is fully known. e.g. bit errors in a communication system are modeled using Bernoulli random variables with a known bit error rate, or sensor noise modeled as an additive Gaussian random variable parametric: p(x θ) has a known form, but there are some unknown parameters. e.g. uncertainty in the number of photons striking a CCD per unit time modeled as a Poisson random variable, but the average photon arrival rate is a unknown parameter nonparametric: p(x θ) belongs to a very large, rich family of models. e.g. people s political leanings have a complex dependency on their family backgrounds, education, socio-economic status, habitat, age,... this distribution cannot be expressed using a simple parametric model distribution free: we avoid any explicit modeling assumptions on p(x θ) and rely on weaker assumptions. e.g. observations are independent and bounded or subgaussian. 6 / 22
7 / 22
Networks of interacting neurons 8 / 22
Networks of interacting neurons 8 / 22
Networks of interacting neurons 8 / 22
Networks of interacting neurons 8 / 22
Networks of interacting neurons 8 / 22
Networks of interacting neurons 8 / 22
Networks of interacting neurons 8 / 22
Networks of interacting neurons 8 / 22
Networks of interacting neurons 8 / 22
Networks of interacting neurons 8 / 22
Networks of interacting neurons 8 / 22
Networks of interacting neurons 8 / 22
Networks of interacting neurons 8 / 22
Networks of interacting neurons 8 / 22
Networks of interacting neurons 8 / 22
Networks of interacting neurons 8 / 22
Networks of interacting neurons Probabilistic models capture the dependencies among neuron firing times. 8 / 22
Three main problems There are three fundamental inference problems in machine learning and statistical signal processing that will be the focus of this course: 1. Decision-making, testing, and detection 2. Regression and estimation 3. Learning and prediction 9 / 22
Problem 1: Decision-making, testing, and detection Example: Decision-making Suppose that θ takes one of two possible values, so that either p(x θ 1 ) or p(x θ 2 ) fit the data x the best. Then we need to decide whether p(x θ 1 ) is a better model than p(x θ 2 ). More generally, θ may be one of a finite number of values {θ 1,..., θ M } and we must decide among the M models. 10 / 22
A Decision-Making Example 10 2 4 Consider a binary communication system. Let s = [s 1,..., s n ] denote a digitized waveform. A transmitter communicates a bit of information by sending s or s (for 1 or 0, respectively). The receiver measures a noisy version of the transmitted signal. 8 6 4 2 0 data s s 6 0 0.2 0.4 0.6 0.8 1 i Our task is to (a) model the data as a function of s and (b) use that model to determine whether s or s was transmitted. 11 / 22
Problem 2: Regression and estimation of θ Suppose that θ belongs to an infinite set. Then we must decide or choose among an infinite number of models. In this sense, estimation may be viewed as an extension of detection to infinite model classes. This extension presents many new challenges and issues and so it is given its own name. 12 / 22
A Parameter Estimation Example Example: Radar Range Estimation 13 / 22
An Nonparametric Estimation Example Example: Image restoration Imagine that you are collaborating with biologists who are interested in imaging biological systems using a new type of microscopy. The imaging system doesn t produce perfect images: the data collected is distorted and noisy. As a signal processing expert, you are asked to develop an image processing algorithm to restore the image. http://www.nature.com/srep/2013/130828/srep02523/full/srep02523.html 14 / 22
Our task is to (a) form a probabilistic model p(x θ) and (b) estimate θ to restore the image. Example: Image restoration (cont.) Let us assume that the distortion is a linear operation. Then we can model the collected data by the following equation. where x = Hθ + w θ is the ideal image we wish to recover (represented as a vector, each element of which is a pixel), H is a known model of the distortion (represented as a matrix), and w is a vector of noise. 15 / 22
Problem 3: Learning and prediction In many problems we wish to predict the value of a label y given an observation of related data x. The conditional distribution of y given x is denoted by p(y x) (or p(y x; θ)) and the prediction problem can then be viewed as determining a value of y that is highly probable given x. Sometimes we don t know a good model the relationship between x and y, but we do have a number of training examples, say {x i, y i } n i=1, that give us some indication of the relationship. The goal of learning is to design a good prediction rule for y given x using these examples, instead of p(y x). 16 / 22
A Learning Example Example: 21 and me Now imagine you are working with geneticists to develop a diagnostic tool to predict whether patients have a certain disease. The tool is to be based on genomic data from the patient. For example, suppose that a microarray experiment is used to measure the levels of gene expression in the patient. For each of m genes we have an expression level (which reflects the amount of protein that gene is producing). Let x denote an m 1 vector of the expression levels and let y denote a binary variable indicating whether or not the patient has the disease. 17 / 22
A Learning Example (cont.) Example: 21 and me (cont.) Some predictions are more difficult than others... http://imgs.xkcd.com/comics/genetic analysis.png 18 / 22
The Netflix problem 19 / 22
The Netflix prize 20 / 22
Example: Predicting Netflix ratings Here x contains the measured movie ratings and y are the unknown movie ratings we wish to predict. 21 / 22
Example: Predicting Netflix ratings (cont.) One probabilistic model says the underlying matrix of true ratings can be factored into the product of two smaller matrices. 22 / 22
Example: Predicting Netflix ratings (cont.) One probabilistic model says the underlying matrix of true ratings can be factored into the product of two smaller matrices. 22 / 22