Gaussian Models

Similar documents
The Bayes classifier

Introduction to Machine Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Introduction to Machine Learning

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Generative Clustering, Topic Modeling, & Bayesian Inference

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

L11: Pattern recognition principles

STA 4273H: Statistical Machine Learning

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

CS281 Section 4: Factor Analysis and PCA

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Bayesian Regression Linear and Logistic Regression

Linear Methods for Prediction

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Generative Models for Discrete Data

Regularized Discriminant Analysis. Part I. Linear and Quadratic Discriminant Analysis. Discriminant Analysis. Example. Example. Class distribution

Pattern Recognition and Machine Learning

5. Discriminant analysis

Ch 4. Linear Models for Classification

ECE521 week 3: 23/26 January 2017

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Machine Learning (CS 567) Lecture 5

CMSC858P Supervised Learning Methods

Introduction to Machine Learning Spring 2018 Note 18

Linear Classification: Probabilistic Generative Models

Machine Learning for Signal Processing Bayes Classification and Regression

Nonparametric Bayesian Methods (Gaussian Processes)

Metric-based classifiers. Nuno Vasconcelos UCSD

Statistical Data Mining and Machine Learning Hilary Term 2016

Dimension Reduction (PCA, ICA, CCA, FLD,

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Chris Bishop s PRML Ch. 8: Graphical Models

STA 4273H: Sta-s-cal Machine Learning

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Introduction to Machine Learning

Recent Advances in Bayesian Inference Techniques

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

Lecture 3: Pattern Classification

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Discriminant Analysis Documentation

SGN (4 cr) Chapter 5

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Dimension Reduction. David M. Blei. April 23, 2012

Machine Learning 2nd Edition

Computational Genomics

Bayesian Models in Machine Learning

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

MSA220 Statistical Learning for Big Data

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Curve Fitting Re-visited, Bishop1.2.5

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

Document and Topic Models: plsa and LDA

Lecture 9: Classification, LDA

COM336: Neural Computing

Learning Bayesian network : Given structure and completely observed data

Introduction to Machine Learning

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Lecture 9: Classification, LDA

CLASSICAL NORMAL-BASED DISCRIMINANT ANALYSIS

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Mathematical Formulation of Our Example

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Bayesian Approaches Data Mining Selected Technique

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Machine Learning - MT Classification: Generative Models

Lecture 3: Pattern Classification. Pattern classification

Regularized Discriminant Analysis and Reduced-Rank LDA

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Midterm sample questions

Classification: Linear Discriminant Analysis

Introduction to Graphical Models

STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Natural Language Processing. Statistical Inference: n-grams

Machine Learning Practice Page 2 of 2 10/28/13

Classification 2: Linear discriminant analysis (continued); logistic regression

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

7 Gaussian Discriminant Analysis (including QDA and LDA)

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

LEC 4: Discriminant Analysis for Classification

Transcription:

Gaussian Models ddebarr@uw.edu 2016-04-28

Agenda Introduction Gaussian Discriminant Analysis Inference Linear Gaussian Systems The Wishart Distribution Inferring Parameters

Introduction Gaussian Density Function MultiVariate Normal (MVN) Probability Density Function (PDF)

Introduction Visualization of a 2D Gaussian Density Based on spectral decomposition of the covariance matrix

Introduction Maximum Likelihood Estimate for Parameters Reminder for reading: To maximize a function f(), we solve f (x) = 0 for the value of x

Introduction Gaussian Functions Fitted to Data

Introduction Maximum Entropy Interpretation of Gaussian The multivariate Gaussian is the distribution with maximum entropy, subject to the constraints that it has a specified mean and covariance Maximum entropy means fewer assumptions

Gaussian Discriminant Analysis Gaussian Discriminant Analysis A discriminant is a function that can be used to distinguish between members of different classes Gaussian discriminant analysis uses the Gaussian distribution for the class conditional density function

Gaussian Discriminant Analysis Quadratic Discriminant Analysis (QDA) Called quadratic because it uses squared terms The decision boundary may not be linear [may not be a line in 2- dimensional space (or a hyperplane in n-dimensional space)]

Gaussian Discriminant Analysis Examples of QDA Decision Boundaries

Gaussian Discriminant Analysis Linear Discriminant Analysis (LDA) The same model as QDA, except all classes use the same covariance matrix The decision boundary is linear: the quadratic term cancels out because it becomes independent of the class Note: there are two separate expansions for the LDA acronym Linear Discriminant Analysis is used for classification Latent Dirichlet Allocation is used for topic modeling (for text)

Gaussian Discriminant Analysis Linear Discriminant Analysis Softmax function

Gaussian Discriminant Analysis Softmax Distribution Note When normalized by temperature, a winner emerges as temperature is reduced S η T T is a constant called the temperature റη = 3, 0, 1

Gaussian Discriminant Analysis Examples of LDA Decision Boundaries

Gaussian Discriminant Analysis The Geometry of 2 Class LDA

Gaussian Discriminant Analysis Strategies to Avoid Overfitting Use a diagonal covariance matrix for each class [naïve Bayes] Use the same covariance matrix for all classes [LDA] Use a full covariance matrix, but impose a prior and then integrate it out [analogous to Bayesian naïve Bayes] Fit the covariance matrix using MAP estimation Project the data to a lower-dimension subspace and fit the Gaussians there

Gaussian Discriminant Analysis Regularized LDA MAP estimation of the covariance matrix A larger value of lambda means reduced covariance (off diagonal entries shifting towards zero)

Gaussian Discriminant Analysis Diagonal LDA A diagonal covariance matrix is used with pooled empirical variance

Gaussian Discriminant Analysis Nearest Neighbor Shrunken Centroids Classifier For feature j : ClassSpecificMean = GlobalMean + ClassSpecificOffset Ignore features where ClassSpecificOffset is always zero SBRCT: Small Blue Round Cell Tumor data 2,308 gene expression values 4 classes 63 training examples 20 testing examples

Gaussian Discriminant Analysis SRBCT Centroids [Gene Expression Values] Gray: ClassSpecificMean Blue: ClassSpecificOffset

Inference Marginals and Posterior Conditionals Reminder: The variables on this page are vectors and matrices. Example: For posterior conditional, we may use observed variables to estimate the density for unobserved variables.

Inference 2D Example: Marginals and Posterior Conditionals

Inference 2D Example Visualized Centered at (0, 0) Correlation coefficient is 0.8 Marginal for x 1 Posterior for x 1 conditioned on x 2

Inference Inference: Interpolating Noise-Free Data Smoothed estimates based on second order differencing.

Inference Example: Interpolating Noise Free Data Larger emphasis on prior Smaller emphasis on likelihood Noise free: we believe the sensors are accurate, so the line goes through the observed values. Picture shows interval estimates for x 2 given a value for x 1 (larger as we move away from observed values).

Inference Inference: Data Imputation Imputing missing values, based on parameter estimates for observed data Another example where we re making use of a conditional density; i.e. the density for the missing values given the observed values. [20 dimensions!]

Linear Gaussian Systems Inference About A Noisy 1D Observation We say that a distribution with fatter tails is less precise. On the left, we see a more precise prior pulls the posterior toward it. On the right, we see a less precise prior has a reduced effect on the posterior.

Linear Gaussian Systems Inference About Noisy 2D Observations The prior (in the center) has a mean at (0, 0) and a variance of 0.1 for each dimension. After only 10 observations, the posterior is more closely aligned with the data. The true location is located at (0.5, 0.5).

Linear Gaussian Systems Sensor Fusion The contours reflect our uncertainty. On the left, we are equally uncertain about the green and red sensors, so our posterior (black) is equidistant. In the middle, we have more uncertainty about the red sensor, so our posterior is closer to the green. On the right, we have more uncertainty about the first dimension for the red sensor and the second dimension for the green.

Linear Gaussian Systems Interpolation with Noisy Data Again, the stronger prior appears on the left. The line no longer passes through all observations, reflecting our uncertainty about the observations.

The Wishart Distribution Wishart Distribution for the Covariance Matrix The Wishart distribution is a generalization of the gamma distribution. It s used to model uncertainty about the covariance matrix parameter.

Inferring Parameters Covariance Matrix Estimation Recall that the eigenvalue measures variation. On the right, with fewer observations, we see the MAP estimate is better aligned to the true values than the MLE. On the left, with more observations, we see the MLE is moving closer to the true values.

Inferring Parameters Sequential Updating of the Posterior for Variance As we accumulate more evidence, our estimate moves closer to the true variance

Inferring Parameters Effect of Prior on Parameter Estimates Based on NIX: Normal Inverse Chi-Squared Distribution We re estimating the mean and variance for a one-dimensional Gaussian. On the left, our priors have less emphasis. In the middle, we re emphasizing our prior for the mean. On the right, we re emphasizing our prior for the variance.

Inferring Parameters Multi-Sensor Fusion Example Plug-in Approximation: Emphasizing the more reliable sensor Exact Posterior: Reflecting uncertainty about which sensor is correct

Natural Language Processing

Natural Language Processing Representation Latent Dirichlet Allocation

Applications Speech recognition Language translation Information retrieval Text classification (e.g. sentiment analysis) anything where human language is an input

Preprocessing Steps for Text Options Convert text to lower-case Remove stop words (e.g. a, an, the, ) Stem tokens (e.g. remove suffixes like ing ) Break up text into tokens (e.g. based on whitespace and punctuation) Derive ngrams (e.g. bigrams are pairs of adjacent words) Derive inverse document frequency values for training corpus idf produces larger weights for terms which appear in fewer documents and smaller weights for terms which appear in more documents Derive term-frequency inverse document frequency vectors

Downloading the Natural Language ToolKit import nltk nltk.download() [click book then Download ]

Text Classification Example A corpus (set) of Reuters news articles is used to produce a text classification model The vectors are in sparse format, with one line per article Each line begins with a class label: 1 for positive class observations and -1 for negative class observations The remaining entries consist of non-zero index:tfidf value pairs for terms found in the news article

Topic Modeling: Latent Dirichlet Allocation [the number of words for a document] [the topic probabilities for a document]

LDA: Probability of Observing Corpus LDA is an unsupervised learning method. Our goal is to learn about groups of related terms. http://jmlr.org/papers/v3/blei03a.html The likelihood of the corpus can be estimated using our model:

Pseudocode for LDA Training With Gibbs Sampling

Installing LDA pip install lda user See the python notebook for the examples covered in class