Variable selection and feature construction using methods related to information theory

Similar documents
X={x ij } φ ik =φ k (x i ) Φ={φ ik } Lecture 11: Information Theoretic Methods. Mutual Information as Information Gain. Feature Transforms

Expectation Maximization

PATTERN RECOGNITION AND MACHINE LEARNING

Computational Systems Biology: Biology X

L11: Pattern recognition principles

Bioinformatics: Biology X

ECE521 week 3: 23/26 January 2017

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Information Theory and Feature Selection (Joint Informativeness and Tractability)

Statistical Data Mining and Machine Learning Hilary Term 2016

Minimum Entropy, k-means, Spectral Clustering

Probabilistic & Unsupervised Learning

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Expectation Propagation Algorithm

Medical Imaging. Norbert Schuff, Ph.D. Center for Imaging of Neurodegenerative Diseases

Information Theory Primer:

Pattern Recognition and Machine Learning

Lecture 2: August 31

Information Theory. Mark van Rossum. January 24, School of Informatics, University of Edinburgh 1 / 35

Series 7, May 22, 2018 (EM Convergence)

Lecture 2: Simple Classifiers

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Clustering. Léon Bottou COS 424 3/4/2010. NEC Labs America

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Entropy Manipulation of Arbitrary Non I inear Map pings

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Linear & nonlinear classifiers

Maximum Within-Cluster Association

PATTERN CLASSIFICATION

Independent Component Analysis and Unsupervised Learning

Linear & nonlinear classifiers

Feature Extraction with Weighted Samples Based on Independent Component Analysis

COMPARISON OF TWO FEATURE EXTRACTION METHODS BASED ON MAXIMIZATION OF MUTUAL INFORMATION. Nikolay Chumerin, Marc M. Van Hulle

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Lecture 3: Pattern Classification

Intelligent Systems Discriminative Learning, Neural Networks

The Method of Types and Its Application to Information Hiding

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Expectation Maximization

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Data Preprocessing. Cluster Similarity

Gaussian Mixture Models

ECE 4400:693 - Information Theory

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Information Theory in Intelligent Decision Making

Linear Models for Classification

Quiz 2 Date: Monday, November 21, 2016

Clustering with k-means and Gaussian mixture distributions

13: Variational inference II

Classification & Information Theory Lecture #8

Posterior Regularization

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning for Signal Processing Bayes Classification and Regression

Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification

5 Mutual Information and Channel Capacity

Quantitative Biology II Lecture 4: Variational Methods

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Independent Component Analysis. Contents

DEEP LEARNING CHAPTER 3 PROBABILITY & INFORMATION THEORY

No. of dimensions 1. No. of centers

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Lecture 22: Final Review

Empirical Risk Minimization

Density Estimation. Seungjin Choi

EE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Statistical Learning. Dong Liu. Dept. EEIS, USTC

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26


Curve Fitting Re-visited, Bishop1.2.5

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Machine Learning Srihari. Information Theory. Sargur N. Srihari

CS 630 Basic Probability and Information Theory. Tim Campbell

COM336: Neural Computing

Ch 4. Linear Models for Classification

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Bregman Divergences for Data Mining Meta-Algorithms

Machine Learning Techniques for Computer Vision

the Information Bottleneck

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Machine Learning (CS 567) Lecture 5

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Expectation-Maximization

Machine Learning Lecture 5

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Variational Inference (11/04/13)

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Nonparametric Bayesian Methods

Machine Learning Lecture 2

Transcription:

Outline Variable selection and feature construction using methods related to information theory Kari 1 1 Intelligent Systems Lab, Motorola, Tempe, AZ IJCNN 2007

Outline Outline 1 Information Theory and Communication Channels in Practice 2 and classification problems Class Separability Measures The Bayes Error MI in Variable Selection 3 based on Maximizing 4 Learning Distance Metrics Information Bottleneck 5

Outline Outline 1 Information Theory and Communication Channels in Practice 2 and classification problems Class Separability Measures The Bayes Error MI in Variable Selection 3 based on Maximizing 4 Learning Distance Metrics Information Bottleneck 5

Outline Outline 1 Information Theory and Communication Channels in Practice 2 and classification problems Class Separability Measures The Bayes Error MI in Variable Selection 3 based on Maximizing 4 Learning Distance Metrics Information Bottleneck 5

Outline Outline 1 Information Theory and Communication Channels in Practice 2 and classification problems Class Separability Measures The Bayes Error MI in Variable Selection 3 based on Maximizing 4 Learning Distance Metrics Information Bottleneck 5

Outline Outline 1 Information Theory and Communication Channels in Practice 2 and classification problems Class Separability Measures The Bayes Error MI in Variable Selection 3 based on Maximizing 4 Learning Distance Metrics Information Bottleneck 5

Information Theory Why Information Theory? and Communication Channels in Practice Variables or features can be understood as a noisy channel that conveys information about the message The aim would be to select or to construct features that provide as much information as possible about the message By using information theory, variable selection and feature construction can be viewed as coding and distortion problems Read Shannon!

Outline Information Theory and Communication Channels in Practice 1 Information Theory and Communication Channels in Practice 2 and classification problems Class Separability Measures The Bayes Error MI in Variable Selection 3 based on Maximizing 4 Learning Distance Metrics Information Bottleneck 5

Entropy Information Theory and Communication Channels in Practice Continuous random variable X R d representing available variables or observations and a discrete-valued random variable Y representing the class labels The uncertainty or entropy in drawing one sample of Y at random according to Shannon s definition: H(Y ) = E y[log 2 1 p(y) ] = X y p(y) log 2 (p(y)). (1) (Differential) entropy can also be written for a continuous variable as Z 1 H(X) = E x[log 2 p(x) ] = p(x) log 2 (p(x))dx. (2) x

Entropy Information Theory and Communication Channels in Practice Continuous random variable X R d representing available variables or observations and a discrete-valued random variable Y representing the class labels The uncertainty or entropy in drawing one sample of Y at random according to Shannon s definition: H(Y ) = E y[log 2 1 p(y) ] = X y p(y) log 2 (p(y)). (1) (Differential) entropy can also be written for a continuous variable as Z 1 H(X) = E x[log 2 p(x) ] = p(x) log 2 (p(x))dx. (2) x

Entropy Information Theory and Communication Channels in Practice Continuous random variable X R d representing available variables or observations and a discrete-valued random variable Y representing the class labels The uncertainty or entropy in drawing one sample of Y at random according to Shannon s definition: H(Y ) = E y[log 2 1 p(y) ] = X y p(y) log 2 (p(y)). (1) (Differential) entropy can also be written for a continuous variable as Z 1 H(X) = E x[log 2 p(x) ] = p(x) log 2 (p(x))dx. (2) x

Information Theory Conditional Entropy, and Communication Channels in Practice After having made an observation of a variable vector x, the uncertainty of the class identity is defined in terms of the conditional density p(y x): Z H(Y X) = p(x) X! p(y x) log 2 (p(y x)) dx. (3) x y Reduction in class uncertainty after having observed the variable vector x is called the mutual information between X and Y I(Y, X) = H(Y ) H(Y X) (4) = X Z p(y, x) p(y, x) log 2 dx (5) y x p(y)p(x) Same as Kullback-Leibler divergence between the joint density p(y, x) and its factored form p(y)p(x).

Information Theory Conditional Entropy, and Communication Channels in Practice After having made an observation of a variable vector x, the uncertainty of the class identity is defined in terms of the conditional density p(y x): Z H(Y X) = p(x) X! p(y x) log 2 (p(y x)) dx. (3) x y Reduction in class uncertainty after having observed the variable vector x is called the mutual information between X and Y I(Y, X) = H(Y ) H(Y X) (4) = X Z p(y, x) p(y, x) log 2 dx (5) y x p(y)p(x) Same as Kullback-Leibler divergence between the joint density p(y, x) and its factored form p(y)p(x).

Information Theory Conditional Entropy, and Communication Channels in Practice Illustration of entropies H(X) and H(Y ) are each represented by a circle Joint entropy H(X, Y ) consists of the union of the circles Mutual information I(X, Y ) is the intersection of the circles H(X, Y ) = H(X) + H(Y ) I(X; Y ) H(Y) H(X,Y) H(X) H(Y X) I(X;Y) H(X Y)

Outline Information Theory and Communication Channels in Practice 1 Information Theory and Communication Channels in Practice 2 and classification problems Class Separability Measures The Bayes Error MI in Variable Selection 3 based on Maximizing 4 Learning Distance Metrics Information Bottleneck 5

Channel Coding Information Theory and Communication Channels in Practice Shannon: Channel with input X and output Y Rate of transmission of information R = H(X) H(X Y ) = I(X, Y ) The capacity of this particular (fixed) channel is defined as the maximum rate over all possible input distributions, C = max p(x) R Maximizing the rate = choosing an input distribution that matches the channel (under some constraints, such as fixed power or efficiency of the channel)

Information Theory and Communication Channels in Practice Analogy to variable selection and feature construction Real source Y is now represented (encoded) as the available variables X Now the channel input distribution X is fixed Modify how the input is communicated to the receiver by the channel either by selecting a subset of available variables or by constructing new features Φ = g(x, θ) where g denotes some selection or construction function, and θ represents some tunable parameters In Shannon s case θ was fixed but X was subject to change channel capacity can be represented as C = max θ R subject to some constraints, such as keeping the dimensionality of the new feature representation as a small constant

Information Theory Rate-distortion theorem and Communication Channels in Practice Finding the simplest representation (in terms of bits/sec) to a continuous source signal within a given tolerable upper limit of distortion Would not waste the channel capacity Solution for a given distortion D is the representation Φ that minimizes the rate R(D) = min E(d) D I(X, Φ) Combination of the two results in a loss function L(p(φ x)) = I(X, Φ) βi(φ, Y ). (6) that does not require setting constraints to the dimensionality of the representation, rather it emerges as the solution The representation Φ can be seen as a bottleneck that extracts relevant information about Y from X (Tishby 1999)

Information Theory Rate-distortion theorem and Communication Channels in Practice Finding the simplest representation (in terms of bits/sec) to a continuous source signal within a given tolerable upper limit of distortion Would not waste the channel capacity Solution for a given distortion D is the representation Φ that minimizes the rate R(D) = min E(d) D I(X, Φ) Combination of the two results in a loss function L(p(φ x)) = I(X, Φ) βi(φ, Y ). (6) that does not require setting constraints to the dimensionality of the representation, rather it emerges as the solution The representation Φ can be seen as a bottleneck that extracts relevant information about Y from X (Tishby 1999)

Outline Information Theory and Communication Channels in Practice 1 Information Theory and Communication Channels in Practice 2 and classification problems Class Separability Measures The Bayes Error MI in Variable Selection 3 based on Maximizing 4 Learning Distance Metrics Information Bottleneck 5

Information Theory Estimating and Communication Channels in Practice Between two variables use non-parametric histogram approach (Battiti 94), but in higher dimensions any amount of data is too sparse to bin. Parametric class density estimates (such as Gaussians) and plug them into the definition of MI MI is a difference between two entropies: Entropy estimation! The simplest way is the maximum likelihood estimate based on histograms known to have a negative bias that can be corrected to some extent by the so-called Miller-Madow bias correction. This consists of adding ( ˆm 1)/2N to the estimate, where ˆm denotes an estimate of the number of bins with nonzero probability this cannot be done in many practical cases, such as when the number of bins is close to the number of observations (Paninski 93) Bayesian techniques can be used if some information about the underlying probability density function is available in terms of a prior (Wolpert & Wolf 95; Zaffalon 02)

Information Theory Measures other than Shannon s and Communication Channels in Practice Shannon derived the entropy measure axiomatically and showed that no other measure would fulfill all the axioms If we want to find a distribution that minimizes/maximizes the entropy or divergence, the axioms used in deriving the measure can be relaxed and still the result of the optimization is the same distribution (Kapur, 1994) One example is the Renyi entropy, which is defined for a discrete variable Y and for a continuous variable X as H α(y ) = 1 1 α log X 2 p(y) α ; H α(x) = 1 Z 1 α log 2 p(x) α dx, where α > 0, α 1, and lim α 1 H α = H y Quadratic Renyi entropy is straightforward to estimate from a set of samples using the Parzen window approach x (7)

Information Theory Non-Parametric Estimation of Renyi Entropy and Communication Channels in Practice Make use of the fact, that a convolution of two Gaussians is a Gaussian, that is, Z G(y a i, Σ 1 )G(y a j, Σ 2 )dy = G(a i a j, Σ 1 + Σ 2 ). (8) y Renyi entropy reduces to samplewise interactions when combined with Parzen density estimation (Principe, Fisher, and Xu, 2000). Z H R (Y ) = log p(y) 2 dy y 0 = log 1 Z NX @ N 2 = log 1 N 2 N X y k=1 j=1 k=1 j=1 1 NX G(y y k, σ 2 I)G(y y j, σ 2 I) A dy NX G(y k y j, 2σ 2 I). (9)

Information Theory Divergence Measures and Communication Channels in Practice Kullback-Leibler divergence Z K (f, g) = x f (x) log f (x) dx (10) g(x) Variational distance (Based on the f-divergence family) Z V (f, g) = f (x) g(x) dx. (11) x Quadratic divergence Z D(f, g) = (f (x) g(x)) 2 dx, (12) x Pinsker s inequality gives a lower bound on K (f, g) 1 2 V (f, g)2. Since f (x) and g(x) are probability density functions, both are between zero and one, and f (x) g(x) (f (x) g(x)) 2, and thus V (f, g) D(f, g). Maximizing D(f, g) thus maximizes a lower bound to K (f, g).

Outline Information Theory Class Separability Measures The Bayes Error MI in Variable Selection 1 Information Theory and Communication Channels in Practice 2 and classification problems Class Separability Measures The Bayes Error MI in Variable Selection 3 based on Maximizing 4 Learning Distance Metrics Information Bottleneck 5

Information Theory Class Separability Measures Class Separability Measures The Bayes Error MI in Variable Selection 1 Sums of distances between data points of different classes. 2 Nonlinear functions of the distances or sums of the distances. 3 Probabilistic measures based on class conditional densities. These measures may make an approximation to class conditional densities followed by some distance measure between densities (Battacharyya distance or divergence) A Gaussian assumption usually needs to be made about the class-conditional densities to make numerical optimization tractable. Equal class covariance assumption, although restrictive, leads to the well known Linear Discriminant Analysis (LDA), which has an analytic solution. Some measures allow non-parametric estimation of the class conditional densities. 4 The Bayes error

Outline Information Theory Class Separability Measures The Bayes Error MI in Variable Selection 1 Information Theory and Communication Channels in Practice 2 and classification problems Class Separability Measures The Bayes Error MI in Variable Selection 3 based on Maximizing 4 Learning Distance Metrics Information Bottleneck 5

Information Theory Class Separability Measures The Bayes Error MI in Variable Selection Relation of The Bayes Error to The Bayes risk using 0/1-loss for classification can be written as the Bayes error: Z «e bayes (X) = E x[pr(y ŷ)] = p(x) 1 max(p(y i x)) dx, (13) i An upper bound on the Bayes error (Hellman, 1970; Feder 1990) x e bayes (X) 1 2 H(Y X) = 1 (H(Y ) I(Y, X)) (14) 2 A lower bound on the error also involving conditional entropy or mutual information is given by Fano s (1961) inequality e bayes (X) 1 I(Y, X) + log 2, (15) log( Y ) where Y refers to the cardinality of Y. Both bounds are minimized when the mutual information between Y and X is maximized, or when H(Y X) is minimized. The bounds are relatively tight, in the sense that both inequalities can be obtained with equality

Outline Information Theory Class Separability Measures The Bayes Error MI in Variable Selection 1 Information Theory and Communication Channels in Practice 2 and classification problems Class Separability Measures The Bayes Error MI in Variable Selection 3 based on Maximizing 4 Learning Distance Metrics Information Bottleneck 5

Information Theory Pairwise MI in variable selection Class Separability Measures The Bayes Error MI in Variable Selection MIFS 1: Set ˆX = argmax Xi I(Y, X i ); set Φ { ˆX}; set F {X 1,..., X N } \ { ˆX}. 2: For all pairs (i, j), X i F and X j Φ evaluate and save I(X i, X j ) unless already saved. 3: Set ˆX = argmax Xi hi(y, X i ) β P i X j Φ I(X i, X j ) ; set Φ Φ { ˆX}; set F F \ { ˆX}, and repeat from step 2 until Φ is desired.

Outline Information Theory Maximizing 1 Information Theory and Communication Channels in Practice 2 and classification problems Class Separability Measures The Bayes Error MI in Variable Selection 3 based on Maximizing 4 Learning Distance Metrics Information Bottleneck 5

Information Theory Maximizing Learning by Maximizing Between Class Labels and Features Class labels: c High-dimensional data: x g(w,x) Low dimensional features: y Gradient I w I(c,y) (=Information potential) Express I = I({y i, c i }) in a differentiable form and perform gradient ascent (or other optimization) on w, parameters of the transform g as w t+1 = w t + η I w = w t + η N X i=1 I y i y i w 1st part of the last term: information force that other samples exert to y i, 2nd part depends on the transform. If y i = W x i then simply y i W = x T i.

Information Theory Maximizing Non-Parametric MI between Features and Labels Labels discrete random variable C. Features continuous, vector-valued Y. Write I T in between C and Y using the quadratic divergence: I T (C, Y ) = X Z (p(c, y) p(c)p(y)) 2 dy c y = X Z p(c, y) 2 dy c y + X Z p(c) 2 p(y) 2 dy c y 2 X Z p(c, y)p(c)p(y)dy (16) c y

Information Theory Maximizing Non-Parametric MI between Features and Labels Using a data set of N samples and expressing class densities as their Parzen estimates with kernel width σ results in I T ({y i, c i }) = V IN + V ALL 2V BTW = 1 XN c N 2 J p X J p X p=1 k=1 l=1 0 + 1 XN c @ N 2 p=1 Jp N 2 1 XN c J p N 2 N p=1 G(y pk y pl, 2σ 2 I) «2 1 J p X A j=1 k=1 NX k=1 l=1 NX G(y k y l, 2σ 2 I) NX G(y pj y k, 2σ 2 I) (17)

Information Theory Gradient of the Information Potential Maximizing First, we need the derivative of the potential, or, the force between two samples as With this we get for V IN G(y y i y j, 2σ 2 I) = G(y i y j, 2σ 2 I) (y j y i ). (18) i 2σ 2 V IN = 1 y ci N 2 σ 2 J c X k=1 G(y ck y ci, 2σ 2 I)(y ck y ci ). (19) This represents a sum of forces that other particles in class c exert to particle y ci (direction is towards y ci ). y i V ALL represents a sum of forces that other particles regardless of class exert to particle y ci (towards y i ). The effect of V y BTW away from y i ci, and it represents the repulsion of classes away from each other

Information Theory Maximizing Information Potential and Information Forces Mutual information I T ({y i, c i }) can now be interpreted as an information potential induced by samples of data in different classes. I/ y i can be interpreted as an information force that other samples exert to sample y i. It has three components: 1 Samples within a class attract each other 2 All samples attract each other 3 Samples between classes repel each other 3 2 1 0 1 2 3 3 2 1 0 1 2 3 Computing I/ y i for all y i requires O(N 2 ) operations (, 2003).

Outline Information Theory Maximizing 1 Information Theory and Communication Channels in Practice 2 and classification problems Class Separability Measures The Bayes Error MI in Variable Selection 3 based on Maximizing 4 Learning Distance Metrics Information Bottleneck 5

Information Theory Effect of the kernel width on the forces Maximizing Three classes in three dimensions projected onto a two-dimensional subspace. 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1 1 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 Wide kernel Narrow kernel

LDA vs. MMI Information Theory Maximizing Landsat satellite image database from UCI repository: Six classes in 36 dimensions projected onto a two-dimensional subspace using... 3 3 2 2 1 1 0 0 1 1 2 2 3 3 2 1 0 1 2 3 LDA compact representation of classes on top of each other! 3 3 2 1 0 1 2 3 MMI classes not as compact need not look Gaussian and better separated!

Information Theory PCI vs. LDA vs. MMI Maximizing Three classes in 12 dimensions (oil pipe-flow from Aston University) projected onto a two-dimensional subspace using PCA (left), LDA or MMI with a wide kernel (middle), and MMI using a narrow kernel (right). PCA LDA/MMI wide MMI narrow

Outline Information Theory Maximizing 1 Information Theory and Communication Channels in Practice 2 and classification problems Class Separability Measures The Bayes Error MI in Variable Selection 3 based on Maximizing 4 Learning Distance Metrics Information Bottleneck 5

Information Theory Learning nonlinear transforms using MI Maximizing Class labels: c High-dimensional data: x g(w,x) Low dimensional features: y Gradient I w I(c,y) (=Information potential) Exactly the same procedure as with linear transforms. The transform g just needs to be continuous (differentiable wrt. the parameter vector w). The information force remains the same, the transform-dependent part y i / w will be different. w t+1 = w t + η I N w = w t + η X I y i y i w i=1

Information Theory Nonlinear transforms Maximizing Multilayer perceptrons Gradient of the output w.r.t. weights using (information) backpropagation Hidden layer activation - tanh Output layer activation: Linear: Need orthonormalized weights in output layer Tanh: Can use data-independent kernel width Weight initialization, partially by LDA Radial basis function networks Basis functions by EM separately for each class as mixtures of diagonal-covariance Gaussians Two options: Use MMI only to learn the output layer (this work) Learn all the parameters using MMI

Outline Information Theory Maximizing 1 Information Theory and Communication Channels in Practice 2 and classification problems Class Separability Measures The Bayes Error MI in Variable Selection 3 based on Maximizing 4 Learning Distance Metrics Information Bottleneck 5

Information Theory Stochastic gradient Maximizing Instead of interactions between all pairs of data points, take a sample of just two. Samples of the same class: NO update! Samples in different classes: W t+1 = W t + η X i=1,2 I y i y i W = W t η 8σ 2 G(y 1 y 2, 2σ2 I)(y 2 y 1 )(x T 1 x T 2 ) (20) Full gradient using all pairs, and stochastic gradient using just one pair at a time are two ends of a spectrum: It is more desirable to take as large a random sample of the whole data set as possible, and to compute all the mutual interactions between those samples for one update of W.

Information Theory Semi-Parametric Density Estimation Maximizing Construct a Gaussian Mixture Models model in the low-dimensional output space after a random or an informed guess as the transform: K p X p(y c p) = h pj G(y m pj, S pj ) (21) j=1 The same samples are used to construct a GMM in the input space using the same exact assignments of samples to mixture components as the output space GMMs have. Running the EM-algorithm in the input space is now unnecessary since we know which samples belong to which mixture components. Now we have GMMs in both spaces and a transform mapping between the two Avoid operating in the high-dimensional input space altogether

Video clips Information Theory Maximizing

Outline Information Theory Learning Distance Metrics Information Bottleneck 1 Information Theory and Communication Channels in Practice 2 and classification problems Class Separability Measures The Bayes Error MI in Variable Selection 3 based on Maximizing 4 Learning Distance Metrics Information Bottleneck 5

Fisher Information Information Theory Learning Distance Metrics Information Bottleneck Options to make relevant information more explicit: 1 Variable selection. 2 Feature construction. Selection/construction matrix defines a global Euclidean metric d 2 A(x, x ) = (x x ) T S T S(x x ) = (x x ) T A(x x ) 3 Learning a distance metric locally relevant to target (or some auxiliary variable): A = A(x) d 2 A(x, x + dx) = dx T A(x)dx. Metric should reflect the divergence between conditional distributions of the target. d 2 J (x, x + dx) = D KL [p(y x) p(y x + dx)] = 1 2 dxt J(x)dx. Embedded into e.g. a clustering algorithm results in semi-supervised clustering that reflects the auxiliary variable (Peltonen, 2004).

Fisher Information Information Theory Learning Distance Metrics Information Bottleneck Options to make relevant information more explicit: 1 Variable selection. 2 Feature construction. Selection/construction matrix defines a global Euclidean metric d 2 A(x, x ) = (x x ) T S T S(x x ) = (x x ) T A(x x ) 3 Learning a distance metric locally relevant to target (or some auxiliary variable): A = A(x) d 2 A(x, x + dx) = dx T A(x)dx. Metric should reflect the divergence between conditional distributions of the target. d 2 J (x, x + dx) = D KL [p(y x) p(y x + dx)] = 1 2 dxt J(x)dx. Embedded into e.g. a clustering algorithm results in semi-supervised clustering that reflects the auxiliary variable (Peltonen, 2004).

Fisher Information Information Theory Learning Distance Metrics Information Bottleneck Options to make relevant information more explicit: 1 Variable selection. 2 Feature construction. Selection/construction matrix defines a global Euclidean metric d 2 A(x, x ) = (x x ) T S T S(x x ) = (x x ) T A(x x ) 3 Learning a distance metric locally relevant to target (or some auxiliary variable): A = A(x) d 2 A(x, x + dx) = dx T A(x)dx. Metric should reflect the divergence between conditional distributions of the target. d 2 J (x, x + dx) = D KL [p(y x) p(y x + dx)] = 1 2 dxt J(x)dx. Embedded into e.g. a clustering algorithm results in semi-supervised clustering that reflects the auxiliary variable (Peltonen, 2004).

Fisher Information Information Theory Learning Distance Metrics Information Bottleneck Options to make relevant information more explicit: 1 Variable selection. 2 Feature construction. Selection/construction matrix defines a global Euclidean metric d 2 A(x, x ) = (x x ) T S T S(x x ) = (x x ) T A(x x ) 3 Learning a distance metric locally relevant to target (or some auxiliary variable): A = A(x) d 2 A(x, x + dx) = dx T A(x)dx. Metric should reflect the divergence between conditional distributions of the target. d 2 J (x, x + dx) = D KL [p(y x) p(y x + dx)] = 1 2 dxt J(x)dx. Embedded into e.g. a clustering algorithm results in semi-supervised clustering that reflects the auxiliary variable (Peltonen, 2004).

Fisher Information Information Theory Learning Distance Metrics Information Bottleneck Options to make relevant information more explicit: 1 Variable selection. 2 Feature construction. Selection/construction matrix defines a global Euclidean metric d 2 A(x, x ) = (x x ) T S T S(x x ) = (x x ) T A(x x ) 3 Learning a distance metric locally relevant to target (or some auxiliary variable): A = A(x) d 2 A(x, x + dx) = dx T A(x)dx. Metric should reflect the divergence between conditional distributions of the target. d 2 J (x, x + dx) = D KL [p(y x) p(y x + dx)] = 1 2 dxt J(x)dx. Embedded into e.g. a clustering algorithm results in semi-supervised clustering that reflects the auxiliary variable (Peltonen, 2004).

Outline Information Theory Learning Distance Metrics Information Bottleneck 1 Information Theory and Communication Channels in Practice 2 and classification problems Class Separability Measures The Bayes Error MI in Variable Selection 3 based on Maximizing 4 Learning Distance Metrics Information Bottleneck 5

Information Theory Information Bottleneck Learning Distance Metrics Information Bottleneck If X are the original data, Φ is a seeked representation, and Y variable(s) of importance, minimizing loss function leads to IB solution L(p(φ x)) = I(X, Φ) βi(φ, Y ) 1 p(φ x) = p(t) exp( βd Z (β,x) KL[p(y x), p(y φ)]) P 2 p(y φ) = 1 p(t) x p(y x)p(φ x)p(x) 3 p(φ) = P x p(φ x)p(x) For example, if X are documents, Y are words, and Φ are word clusters, probability of cluster membership decays exponentially with KL-divergence between the word distributions in document x and cluster φ (Tishby, 1999)

Information Theory Shannon s seminal work showed how mutual information provides a measure of the maximum transmission rate of information through a channel Analogy to variable selection and feature construction with mutual information as the criterion to provide maximal information about the variable of interest Best thing since sliced bread for variable selection / feature construction? Maybe not - estimation problems from high-dimensional small(ish) data sets

Information Theory Shannon s seminal work showed how mutual information provides a measure of the maximum transmission rate of information through a channel Analogy to variable selection and feature construction with mutual information as the criterion to provide maximal information about the variable of interest Best thing since sliced bread for variable selection / feature construction? Maybe not - estimation problems from high-dimensional small(ish) data sets

Information Theory Shannon s seminal work showed how mutual information provides a measure of the maximum transmission rate of information through a channel Analogy to variable selection and feature construction with mutual information as the criterion to provide maximal information about the variable of interest Best thing since sliced bread for variable selection / feature construction? Maybe not - estimation problems from high-dimensional small(ish) data sets