Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Similar documents
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Bayesian Support Vector Machines for Feature Ranking and Selection

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Pattern Recognition and Machine Learning

Nonparametric Bayesian Methods (Gaussian Processes)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Lecture 3: Pattern Classification

Unsupervised machine learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Model Selection for Gaussian Processes

Non-Parametric Bayes

Statistical learning. Chapter 20, Sections 1 4 1

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Click Prediction and Preference Ranking of RSS Feeds

Dimension Reduction (PCA, ICA, CCA, FLD,

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

ECE521 week 3: 23/26 January 2017

L11: Pattern recognition principles

PATTERN CLASSIFICATION

Lecture 3: Pattern Classification. Pattern classification

Density Modeling and Clustering Using Dirichlet Diffusion Trees

Bayesian Learning (II)

Introduction to Probabilistic Machine Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Gentle Introduction to Infinite Gaussian Mixture Modeling

Advanced statistical methods for data analysis Lecture 2

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

FINAL: CS 6375 (Machine Learning) Fall 2014

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Computer Vision Group Prof. Daniel Cremers. 3. Regression

PATTERN RECOGNITION AND MACHINE LEARNING

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Bayesian Decision and Bayesian Learning

Support Vector Machines

5. Discriminant analysis

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA 4273H: Statistical Machine Learning

Holdout and Cross-Validation Methods Overfitting Avoidance

PMR Learning as Inference

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

Nonparametric Regression With Gaussian Processes

Kernel methods, kernel SVM and ridge regression

Mining Classification Knowledge

Statistical Pattern Recognition

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

CPSC 540: Machine Learning

Density Estimation. Seungjin Choi

Unsupervised Learning

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Machine Learning, Fall 2009: Midterm

Machine Learning Techniques for Computer Vision

Machine learning - HT Maximum Likelihood

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Learning Bayesian network : Given structure and completely observed data

Variational Principal Components

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

CS6220: DATA MINING TECHNIQUES

Machine Learning 2017

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

CS281 Section 4: Factor Analysis and PCA

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Multivariate statistical methods and data mining in particle physics

GWAS V: Gaussian processes

STA 414/2104: Lecture 8

Machine learning for pervasive systems Classification in high-dimensional spaces

Introduction to Machine Learning

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Statistical Learning. Philipp Koehn. 10 November 2015

Statistical learning. Chapter 20, Sections 1 3 1

Machine Learning Basics: Maximum Likelihood Estimation

STA 414/2104, Spring 2014, Practice Problem Set #1

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Mining Classification Knowledge

Notes on Discriminant Functions and Optimal Classification

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Least Squares Regression

Reward-modulated inference

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

COS513 LECTURE 8 STATISTICAL CONCEPTS

Introduction to Machine Learning

20 Unsupervised Learning and Principal Components Analysis (PCA)

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Course in Data Science

Lecture 1: Bayesian Framework Basics

Machine Learning, Midterm Exam

MTTTS16 Learning from Multiple Sources

CS6220: DATA MINING TECHNIQUES

Unsupervised Learning: Dimensionality Reduction

Week 5: Logistic Regression & Neural Networks

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Density Modeling and Clustering Using Dirichlet Diffusion Trees

Neural networks and optimization

Transcription:

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees Rafdord M. Neal and Jianguo Zhang Presented by Jiwen Li Feb 2, 2006

Outline Bayesian view of feature selection The approach used in the paper Univariate tests and PCA Bayesian neural networks Dirichlet diffusion trees NIPS 2003 Experiment Result Conclusion

Feature selection, why? Improve learning accuracy. Too many features cause overfitting for maximum likelihood approaches, but may not for Bayesian methods. Reduce the computational complexity. It is especially a problem for Bayesian methods. But Dimensionality can be reduced by other means such as PCA. Reduce the cost of measuring features in the future. To make an optimal tradeoff by balancing the costs of feature measurements and the prediction errors.

The Bayesian Approaches Fit the Bayesian model, include your beliefs into the prior by P( θ ) P( Ytrain Xtrain, θ ) P( θ Xtrain, Ytrain) = P( θ ) PY ( train Xtrain, θ) dθ Note: the model could be complex, and uses all the feature. Make predictions using the Bayesian model on the test cases by P( Ynew Xnew, Xtrain, Ytrain) = P( Ynew Xnew, θ ) P( θ Xtrain, Ytrain) dθ Predictions are found by integrating over the parameter space of the model. Find the best subset of features based on the posterior distribution of the mode parameters, and the cost of features used. Note: Knowing the cost of measuring the feature is essential for making the right trade-off.

Using feature construction instead of feature selection (1/3) Use a learning method that is invariant to rotations in the input space and which ignores inputs that are always zero. Rotate the training cases so that only n inputs are nonzero for the training cases, then drop all but one of the zero inputs. Rotate test cases accordingly, setting one input to the distance from the space of training cases.

Using feature construction instead of feature selection (2/3) Use a learning method that is invariant to rotations in the input space and which ignores inputs that are always zero. Example: The Bayesian logistic regression model with spherically symmetric prior. n PY ( i = 1 Xi = xi) = 1 + exp( ( α + β jxij)) j= 1 while β has a multivariate Gaussian prior with zero mean and diagonal covariance. (1) Given any orthogonal matrix R, doing linear transform X = R X, β = R ' ' 1 i i β 1 ' T ' T since β X i = β X, R 1 β i has no effect on probabilities in (1).

Using feature construction instead of feature selection (3/3) Rotate the training cases so that only n inputs are non-zero for the training cases, then drop all but one of the zero inputs. Example: The Bayesian logistic regression model with spherically symmetric prior. There always exist an orthogonal transformation R, for which all but m of the components of the transformed features RX i are zero in all m training cases. PCA is an approximate approaches to doing this transformation. It projects X i onto the m principal components found from the training cases, and projects the portion of X normal to the space of these principle components onto some set i of (n m) additional orthogonal directions. For the training cases, the projections ' in these (n - m) other directions will all approximately be zero, so that X will be ij approximately zero for j > m. Clearly, one then need only compute the first m terms of the sum in (1)

The approach used in the paper 1. Reduce the number of features used for classification to no more than a few hundred, either by selecting a subset of features using simple univariate significance tests, or by performing a global dimensionality reduction using PCA on all training, validation and test set. 2. Apply a neural network based on Bayesian learning as a classification method, using an ARD prior that allows the model to determine which of these features are more relevant. 3. If a smaller number of features is desired, use the relevance hyperparameters from the Bayesian neural network to pick a smaller subset. 4. Apply Dirichlet diffusion trees (an Bayesian hierarchical clustering method) as classification methods, using an ARD prior that allows the model to determine which of these features are most relevant.

Feature selection using univariate tests An initial feature subset was found by simple univariate significance tests. Assumption: Relevant variables will be at least somewhat relevant to the target on their own. Three significance tests were used: Pearson Correlation Spearman Correlation A runs test A p-value is calculated by permutation test.

Spearman correlation Definition: A linear correlation to cases where X and Y are measured on a merely ordinal scale. Formulary: r = 1 6 Preprocessing: Transform the feature value to rank. D 2 s 2 mm ( 1) where m is the number of data, and D = xi yi Advantage for feature selection: Invariant to any monotonic transformation of the original features, and hence can detect any monotonic relationship with the class.

Runs test Purpose: The runs test is used to decide if a data set is from a random process. Definition: Run R is length of the number of increasing, or decreasing, values. Step. 1. computer the mean of the sample. 2. Going through the sample sequence, replace any observation with +, or depending on whether it is above or below the mean. 3. Computer R. 4. Construct hypothesis test on R. Advantage: could be used to detect non-monotonic relation. Detail see: Gibbon, J. D. Nonparametric Statistical Inference, Chapter 3.

Permutation Test Purpose: calculate the p-value for each feature. Principle: If there is no real association, the class labels might as well be matched up with the features in a completely different way. Formulary: 1 1 p= 2min( I( rxy rxy), I( rxy rxy)) π π m! m! π π Where the sums are over all m! possible permutations of, with y π denoting the class labels as permuted by. π y1, y2,... ym

Dimensionality reduction with PCA Benefits: 1. PCA is unsupervised approach, which could use all training, validation and test example. 2. Bayesian model with the spherically symmetric prior is invariable to PCA 3. PCA is feasible even when n is huge, if m is not too large time required is of order min(mn², nm²). Practice: Power transformation is chosen for each feature to increase correlation with the class. Whether to use these transformations, and the other choices, were made manually based on validation set results. Some issues proposed in the paper: 1. Should features be transformed before PCA? Eg, take square roots. 2. Should features be centered? Zero may be informative. 3. Should features be scaled to have the same variance? Original scale may carry information about relevance. 4. Should principle components be standardized before use? May be not.

Tow Layers Neural Networks (1/2) Multilayer perceptron networks, with two hidden layers with tanh activation function. i l l i l = 1 l i l kl k i k = 1 k i k jk ij j= 1 [ ] 1 PY ( = 1 X = x) = 1+ exp( f( x)) i i i i f ( x ) = c+ wh ( x ) H h ( x ) = tanh( b + v g ( x )) g ( x ) = tanh( a + u x ) G n

Two Layer Neural Networks (2/2) gk( x i) h ( x ) l i X. ij..... ( ) f x i... w l u j k v kl N input features G hidden units H hidden units

Conventional neural network learning Learning can be viewed as maximum likelihood estimation for the network parameters. ϖ Value of weights is computed by maximizing m P ( y x, ω ) i = 1 i ( xy, ) Where i i are training case i. i Make predictions for a test case x using the conditional distribution P ( y x, ϖ ) Overfitting happen when the number of network parameters is large than the number of training cases.

Bayesian Neural Network Learning Bayesian predictions are found by integration rather than maximization. For a test case x, y is predicted using P ( y x, ( x, y ),...( x, y )) 1 1 = ( dω ) P ( y x, ω ) P ( ω ( x, y ),..., ( x, y )) The posterior distribution used above is 1 1 We need to define a prior distribution. m m 1 1 P( ω ( x, y ),...,( x, y )) P( ω) P( y x, ω) m m i i i= 1 m m m

ARD Prior What for? By using a hierarchical prior, we can automatically determine how relevant each input is to predicting the class. How? Each feature is associated with a hyper-parameter that expresses how relevant that feature is. Conditional on these hyper-parameters, the input weight have a multivariate Gaussian distribution with zero mean and diagonal covariance matrix, with the variance as hyper-parameter, which is itself given a higher-level prior. Result. If an input feature x is irrelevant, its relevance hyper-parameter β will tend to be small, forcing the relevant weight from that input to be near zero.

Dirichlet Diffusion Trees Procedure(1/2) A Dirichlet diffusion tree model defines a procedure for generating data sets one point at a time in an exchangeable way. Procedure The first point is generated by a simple Gaussian diffusion process from time 0 to 1 The second point follows the path of the first one initially. The second point diverges from the path at a random time t. After the divergence, the second point follows a Gaussian diffusion process independent of the first one.

Dirichlet Diffusion Trees Procedure(2/2) Procedure (continue) The nth point follows the path of those before it initially. The nth point diverges at a random time t At a branch, the nth point selects an old path with probability proportional to the numbers of points that went each way previously After the divergence, it follows a Gaussian diffusion process independently.

Divergence Function If a point following a path of n points did not diverge at time t, the probability of divergence during dt is a(t)dt/n. With more points passing the way, the probability of divergence for a new point is smaller.

Selection of divergence function Select a(t) such that To keep the distribution continuous. Two possibilities Or a 1 0 at () a ( t) dt = ( t ) = = b+ 1 c c ( 1 t) 2 t

Dirichlet Diffusion Trees Model Structure The structure of the tree The divergence times for non-terminal nodes Hyper-parameter Parameters of the divergence function for example, c in a(t) = c/(1-t). Diffusion standard deviations for each variable. Noise standard deviations for each variable.

Bayesian learning of Dirichlet diffusion tree Likelihood The probability of obtaining a given tree and data set can be written as a product of two factor. Prior -The tree factor is the probability of obtaining the given tree structure and divergence times. -The data factor is the probability of obtaining the given locations for divergence points and final data points, given the tree structure and divergence times. Using ARD again. -By using a hierarchical prior, we can automatically determine how relevant each input is to predicting the class.

Classification from tree (1/2) Classify using the structure of the tree -Use the nearest neighbor -Use the nearest neighbor with branch weights 1 1 0 Take the average over trees from the posterior

Classification from tree (2/2) Classify by using divergence time as a measure of distance -Take the average of the divergence times -Use nearest neighbor method -Use (1- divergence time) as a measure of dissimilarity -Use a weighted average based on divergence times. P ( y = 1) = i y = 1, j i j j i ij ( e 1) ij ( e 1) -Note: r can be selected by cross validation rd rd

NIPS2003 Result Best BNN-DDT-combo 0.14 0.12 0.1 BER 0.08 0.06 0.04 0.02 0 Overall Arcene Gisette Dexter Dorothea Madelon

Conclusion It is unclear whether it is the Bayesian model that contribute most on learning performance. It is normally untrue to have a spherically symmetric prior for most real application, I doubt if the Bayesian model is really invariable to PCA. I doubt whether we could use the Dirichet diffusion tree model without knowing that if there is hidden tree structure behind the data?