Information Theory in Computer Vision and Pattern Recognition

Francisco Escolano Pablo Suau Boyán Bonev Information Theory in Computer Vision and Pattern Recognition Foreword by Alan Yuille 123

Francisco Escolano Universidad Alicante Depto. Ciencia de la Computación e Inteligencia Artificial Campus de San Vicente, s/n 03080 Alicante Spain sco@dccia.ua.es Boyán Bonev Universidad Alicante Depto. Ciencia de la Computación e Inteligencia Artificial Campus de San Vicente, s/n 03080 Alicante Spain boyan@dccia.ua.es Pablo Suau Universidad Alicante Depto. Ciencia de la Computación e Inteligencia Artificial Campus de San Vicente, s/n 03080 Alicante Spain pablo@dccia.ua.es ISBN 978-1-84882-296-2 e-isbn 978-1-84882-297-9 DOI 10.1007/978-1-84882-297-9 Springer Dordrecht Heidelberg London New York British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Library of Congress Control Number: 2009927707 c Springer Verlag London Limited 2009 Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms of licenses issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publishers. The use of registered names, trademarks, etc., in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant laws and regulations and therefore free for general use. The publisher makes no representation, express or implied, with regard to the accuracy of the information contained in this book and cannot accept any legal responsibility or liability for any errors or omissions that may be made. Cover design: SPi Publisher Services Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

To my three joys: Irene, Ana, and Mamen. Francisco To my parents, grandparents, and brother. To Beatriz. Pablo To Niya, Elina, and Maria. Boyan

Foreword Computer vision and pattern recognition are extremely important research fields with an enormous range of applications. They are also extremely difficult. This may seem paradoxical since humans can easily interpret images and detect spatial patterns. But this apparent ease is misleading because neuroscience shows that humans devote a large part of their brain, possibly up to 50% of the cortex, to processing images and interpreting them. The difficulties of these problems have been appreciated over the last 30 years as researchers have struggled to develop computer algorithms for performing vision and pattern recognition tasks. Although these problems are not yet completely solved, it is becoming clear that the final theory will depend heavily on probabilistic techniques and the use of concepts from information theory. The connections between information theory and computer vision have long been appreciated. Vision can be considered to be a decoding problem where the encoding of the information is performed by the physics of the world by light rays striking objects and being reflected to cameras or eyes. Ideal observer theories were pioneered by scientists such as Horace Barlow to compute the amount of information available in the visual stimuli, and to see how efficient humans are at exploiting it. But despite the application of information theory to specific visual tasks, there has been no attempt to bring all this work together into a clear conceptual framework. This book fills the gap by describing how probability and information theory can be used to address computer vision and pattern recognition problems. The authors have developed information theory tools side by side with vision and pattern recognition tasks. They have characterized these tools into four classes: (i) measures, (ii) principles, (iii) theories, and (iv) algorithms. The book is organized into chapters addressing computer vision and pattern recognition tasks at increasing levels of complexity. The authors have devoted chapters to feature detection and spatial grouping, image segmentation, matching, clustering, feature selection, and classifier design. As the authors address these topics, they gradually introduce techniques from information theory. These include (1) information theoretic measures, such as entropy VII

VIII Foreword and Chernoff information, to evaluate image features; (2) mutual information as a criteria for matching problems (Viola and Wells 1997); (3) minimal description length ideas (Risannen 1978) and their application to image segmentation (Zhu and Yuille 1996); (4) independent component analysis (Bell and Sejnowski 1995) and its use for feature extraction; (5) the use of rate distortion theory for clustering algorithms; (6) the method of types (Cover and Thomas 1991) and its application to analyze the convergence rates of vision algorithms (Coughlan and Yuille 2002); and (7) how entropy and infomax principles (Linsker 1988) can be used for classifier design. In addition; the book covers alternative information theory measures, such as Rényi alphaentropy and Jensen Shannon divergence, and advanced topics; such as data driven Markov Chain Monte Carlo (Tu and Zhu 2002) and information geometry (Amari 1985). The book describes these theories clearly, giving many illustrations and specifying the code by flow -charts. Overall, the book is a very worthwhile addition to the computer vision and pattern recognition literature. The authors have given an advanced introduction to techniques from probability and information theory and their application to vision and pattern recognition tasks. More importantly, they have described a novel perspective that will be of growing importance over time. As computer vision and pattern recognition develop, the details of these theories will change, but the underlying concepts will remain the same. UCLA, Department of Statistics and Psychology Los Angeles, CA March 2009 Alan Yuille

Preface Looking through the glasses of Information Theory (IT) has proved to be effective both for formulating and designing algorithmic solutions to many problems in computer vision and pattern recognition (CVPR): image matching, clustering and segmentation, salient point detection, feature selection and dimensionality reduction, projection pursuit, optimal classifier design, and many others. Nowadays, researchers are widely bringing IT elements to the CVPR arena. Among these elements, there are measures (entropy, mutual information, Kullback Leibler divergence, Jensen Shannon divergence...), principles (maximum entropy, minimax entropy, minimum description length...) and theories (rate distortion theory, coding, the method of types...). This book introduces and explores the latter elements, together with the one of entropy estimation, through an incremental complexity approach. Simultaneously, the main CVPR problems are formulated and the most representative algorithms, considering authors preferences for sketching the IT CVPR field, are presented. Interesting connections between IT elements when applied to different problems are highlighted, seeking for a basic/skeletal research roadmap. This roadmap is far from being comprehensive at present due to time and space constraints, and also due to the current state of development of the approach. The result is a novel tool, unique in its conception, both for CVPR and IT researchers, which is intended to contribute, as much as possible, to a cross-fertilization of both areas. The motivation and origin of this manuscript is our awareness of the existence of many sparse sources of IT-based solutions to CVPR problems, and the lack of a systematic text that focuses on the important question: How useful is IT for CVPR? At the same time, we needed a research language, common to all the members of the Robot Vision Group. Energy minimization, graph theory, and Bayesian inference, among others, were adequate methodological tools during our daily research. Consequently, these tools were key to design and build a solid background for our Ph.D. students. Soon we realized that IT was a unifying component that flowed naturally among our rationales for tackling CVPR problems. Thus, some of us enrolled in the task of writing IX

X Preface a text in which we could advance as much as possible in the fundamental links between CVPR and IT. Readers (starters and senior researchers) will judge to what extent we have both answered the above fundamental question and reached our objectives. Although the text is addressed to CVPR IT researchers and students, it is also open to an interdisciplinary audience. One of the most interesting examples is the computational vision community, which includes people interested both in biological vision and psychophysics. Other examples are the roboticians and the people interested in developing wearable solutions for the visually impaired (which is the subject of our active work in the research group). Under its basic conception, this text may be used for an IT-based one semester course of CVPR. Only some rudiments of algebra and probability are necessary. IT items will be introduced as the text flows from one computer vision or pattern recognition problem to another. We have deliberately avoided a succession of theorem proof pairs for the sake of a smooth presentation. Proofs, when needed, are embedded in the text, and they are usually excellent pretexts for presenting or highlighting interesting properties of IT elements. Numerical examples with toy settings of the problems are often included for a better understanding of the IT-based solution. When formal elements of other branches of mathematics like field theory, optimization, and so on, are needed, we have briefly presented them and referred to excellent books fully dedicated to their description. Problems, questions and exercises are also proposed at the end of each chapter. The purpose of the problems section is not only to consolidate what is learnt, but also to go one step forward by testing the ability of generalizing the concepts exposed in each chapter. Such section is preceded by a brief literature review that outlines the key papers for the CVPR topic, which is the subject of the chapter. These papers references, together with sketched solutions to the problems, will be progressively accessible in the Web site http://www.rvg.ua.es/itincvpr. We have started the book with a brief introduction (Chapter 1) regarding the four axes of IT CVPR interaction (measures, principles, theories, and entropy estimators). We have also presented here the skeletal research roadmap (the ITinCVPR tube). Then we walk along six chapters, each one tackling a different problem under the IT perspective. Chapter 2 is devoted to interest points, edge detection, and grouping; interest points allow us to introduce the concept of entropy and its linking with Chernoff information, Sanov s theorem, phase transitions and the method of types. Chapter 3 covers contour and region-based image segmentation mainly from the perspective of model order selection through the minimum description length (MDL) principle, although the Jensen Shannon measure and the Jaynes principle of maximum entropy are also introduced; the question of learning a segmentation model is tackled through links with maximum entropy and belief propagation; and the unification of generative and discriminative processes for segmentation and

Preface XI recognition is explored through information divergence measures. Chapter 4 reviews registration, matching, and recognition by considering the following image registration through minimization of mutual information and related measures; alternative derivations of Jensen Shannon divergence yield deformable matching; shape comparison is encompassed through Fisher information; and structural matching and learning are driven by MDL. Chapter 5 is devoted to image and pattern clustering and is mainly rooted in three IT approaches to clustering: Gaussian mixtures (incremental method for adequate order selection), information bottleneck (agglomerative and robust with model order selection) and mean-shift; IT is also present in initial proposals for ensembles clustering (consensus finding). Chapter 6 reviews the main approaches to feature selection and transformation: simple wrappers and filters exploiting IT for bypassing the curse of dimensionality; minimax entropy principle for learning patterns using a generative approach; and ICA/gPCA methods based on IT (ICA and neg-entropy, info-max and minimax ICA, generalized PCA and effective dimension). Finally, Chapter 7, Classifier Design, analyzes the main IT strategies for building classifiers. This obviously includes decision trees, but also multiple trees and random forests, and how to improve boosting algorithms by means of IT-based criteria. This final chapter ends with an information projection analysis of maximum entropy classifiers and a careful exploration of the links between Bregman divergences and classifiers. We acknowledge the contribution of many people to this book. In first place, we thank many scientists for their guide and support, and for their important contributions to the field. Researchers from different universities and institutions such as Alan Yuille, Hamid Krim, Chen Ping-Feng, Gozde Unal, Ajit Rajwadee, Anand Rangarajan, Edwin Hancock, Richard Nock, Shun-ichi Amari, and Mario Figueiredo, among many others, contributed with their advices, deep knowledge and highly qualified expertise. We also thank all the colleagues of the Robot Vision Group of the University of Alicante, especially Antonio Peñalver, Juan Manuel Sáez, and Miguel Cazorla, who contributed with figures, algorithms, and important results from their research. Finally, we thank the editorial board staff: Catherine Brett for his initial encouragement and support, and Simon Rees and Wayne Wheeler for their guidance and patience. University of Alicante, Spain Francisco Escolano Pablo Suau Boyan Bonev

Contents 1 Introduction... 1 1.1 Measures,Principles,Theories,andMore... 1 1.2 Detailed Organization of the Book......................... 3 1.3 TheITinCVPRRoadmap... 10 2 Interest Points, Edges, and Contour Grouping... 11 2.1 Introduction............................................ 11 2.2 EntropyandInterestPoints... 11 2.2.1 Kadir and Brady Scale Saliency Detector............. 12 2.2.2 Point Filtering by Entropy Analysis Through ScaleSpace... 14 2.2.3 Chernoff Information and Optimal Filtering.......... 16 2.2.4 Bayesian Filtering of the Scale Saliency Feature Extractor: The Algorithm.......................... 18 2.3 Information Theory as Evaluation Tool: The Statistical Edge Detection Case..................................... 20 2.3.1 Statistical Edge Detection.......................... 22 2.3.2 Edge Localization................................. 23 2.4 FindingContoursAmongClutter... 26 2.4.1 Problem Statement................................ 27 2.4.2 A RoadTracking... 29 2.4.3 A Convergence Proof............................. 31 2.5 Junction Detection and Grouping.......................... 33 2.5.1 Junction Detection................................ 33 2.5.2 Connecting and Filtering Junctions.................. 35 Problems... 38 2.6 Key References.......................................... 41 3 Contour and Region-Based Image Segmentation... 43 3.1 Introduction............................................ 43 3.2 Discriminative Segmentation with Jensen Shannon Divergence.............................. 44 XIII

XIV Contents 3.2.1 The Active Polygons Functional..................... 44 3.2.2 Jensen Shannon Divergence and the Speed Function... 46 3.3 MDL in Contour-Based Segmentation...................... 53 3.3.1 B-Spline Parameterization of Contours............... 53 3.3.2 MDL for B-Spline Parameterization................. 58 3.3.3 MDL Contour-based Segmentation.................. 60 3.4 Model Order Selection in Region-Based Segmentation........ 63 3.4.1 Jump-Diffusion for Optimal Segmentation............ 63 3.4.2 Speeding-up the Jump-Diffusion Process............. 71 3.4.3 K-adventurers Algorithm........................... 74 3.5 Model-Based Segmentation Exploiting The Maximum EntropyPrinciple... 79 3.5.1 Maximum Entropy and Markov Random Fields....... 79 3.5.2 Efficient Learning with Belief Propagation............ 83 3.6 Integrating Segmentation, Detection and Recognition........ 86 3.6.1 Image Parsing.................................... 86 3.6.2 The Data-Driven Generative Model.................. 91 3.6.3 The Power of Discriminative Processes............... 96 3.6.4 The Usefulness of Combining Generative and Discriminative................................ 99 Problems...100 3.7 Key References.......................................... 103 4 Registration, Matching, and Recognition...105 4.1 Introduction............................................ 105 4.2 Image Alignment and Mutual Information.................. 106 4.2.1 Alignment and Image Statistics..................... 106 4.2.2 Entropy Estimation with Parzen s Windows.......... 108 4.2.3 The EMMA Algorithm............................. 110 4.2.4 Solving the Histogram-Binning Problem.............. 111 4.3 Alternative Metrics for Image Alignment................... 119 4.3.1 Normalizing Mutual Information.................... 119 4.3.2 Conditional Entropies.............................. 120 4.3.3 Extension to the Multimodal Case................... 121 4.3.4 Affine Alignment of Multiple Images................. 122 4.3.5 The RényiEntropy...124 4.3.6 Rényi s Entropy and Entropic Spanning Graphs....... 126 4.3.7 The Jensen Rényi Divergence and Its Applications.... 128 4.3.8 Other Measures Related to RényiEntropy...129 4.3.9 Experimental Results.............................. 132 4.4 Deformable Matching with Jensen Divergence and Fisher Information................................... 132 4.4.1 The Distributional Shape Model..................... 132 4.4.2 Multiple Registration and Jensen Shannon Divergence....................................... 136

Contents XV 4.4.3 Information Geometry and Fisher Rao Information.... 140 4.4.4 Dynamics of the Fisher Information Metric........... 143 4.5 Structural Learning with MDL............................ 146 4.5.1 The Usefulness of Shock Trees...................... 146 4.5.2 A Generative Tree Model Based on Mixtures.......... 147 4.5.3 Learning the Mixture.............................. 150 4.5.4 Tree Edit-Distance and MDL....................... 151 Problems...153 4.6 Key References.......................................... 156 5 Image and Pattern Clustering...157 5.1 Introduction............................................ 157 5.2 Gaussian Mixtures and Model Selection.................... 157 5.2.1 Gaussian Mixtures Methods........................ 157 5.2.2 Defining Gaussian Mixtures........................ 158 5.2.3 EM Algorithm and Its Drawbacks................... 159 5.2.4 Model Order Selection............................. 161 5.3 EBEM Algorithm: Exploiting Entropic Graphs.............. 162 5.3.1 The Gaussianity Criterion and Entropy Estimation.... 162 5.3.2 Shannon Entropy from Rényi Entropy Estimation..... 163 5.3.3 Minimum Description Length for EBEM............. 166 5.3.4 Kernel-Splitting Equations......................... 167 5.3.5 Experiments...................................... 168 5.4 Information Bottleneck and Rate Distortion Theory......... 170 5.4.1 Rate Distortion Theory Based Clustering............. 170 5.4.2 The Information Bottleneck Principle................ 173 5.5 Agglomerative IB Clustering.............................. 177 5.5.1 Jensen Shannon Divergence and Bayesian Classification Error................................ 177 5.5.2 The AIB Algorithm................................ 178 5.5.3 Unsupervised Clustering of Images.................. 181 5.6 Robust Information Clustering............................ 184 5.7 IT-BasedMeanShift...189 5.7.1 The Mean Shift Algorithm.......................... 189 5.7.2 Mean Shift Stop Criterion and Examples............. 191 5.7.3 Rényi Quadratic and Cross Entropy from Parzen Windows...193 5.7.4 Mean Shift from an IT Perspective.................. 196 5.8 Unsupervised Classification and Clustering Ensembles........ 197 5.8.1 Representation of Multiple Partitions............... 198 5.8.2 Consensus Functions............................... 199 Problems...206 5.9 Key References.......................................... 209

XVI Contents 6 Feature Selection and Transformation...211 6.1 Introduction............................................ 211 6.2 Wrapper and the Cross Validation Criterion................ 212 6.2.1 Wrapper for Classifier Evaluation.................... 212 6.2.2 Cross Validation.................................. 214 6.2.3 Image Classification Example....................... 215 6.2.4 Experiments...................................... 219 6.3 Filters Based on Mutual Information....................... 220 6.3.1 Criteria for Filter Feature Selection.................. 220 6.3.2 Mutual Information for Feature Selection............. 222 6.3.3 Individual Features Evaluation, Dependence and Redundancy.................................. 223 6.3.4 The min-redundancy Max-Relevance Criterion....... 225 6.3.5 The Max-Dependency Criterion..................... 227 6.3.6 Limitations of the Greedy Search.................... 228 6.3.7 Greedy Backward Search.......................... 231 6.3.8 Markov Blankets for Feature Selection............... 234 6.3.9 Applications and Experiments...................... 236 6.4 Minimax Feature Selection for Generative Models........... 238 6.4.1 Filters and the Maximum Entropy Principle.......... 238 6.4.2 Filter Pursuit through Minimax Entropy............. 242 6.5 FromPCAtogPCA...244 6.5.1 PCA, FastICA, and Infomax........................ 244 6.5.2 Minimax Mutual Information ICA................... 250 6.5.3 Generalized PCA (gpca) and Effective Dimension.... 254 Problems...265 6.6 Key References.......................................... 269 7 Classifier Design...271 7.1 Introduction............................................ 271 7.2 Model-BasedDecisionTrees...272 7.2.1 Reviewing Information Gain........................ 272 7.2.2 The Global Criterion.............................. 273 7.2.3 Rare Classes with the Greedy Approach.............. 275 7.2.4 Rare Classes with Global Optimization............... 280 7.3 Shape Quantization and Multiple Randomized Trees......... 284 7.3.1 Simple Tags and Their Arrangements................ 284 7.3.2 Algorithm for the Simple Tree...................... 285 7.3.3 More Complex Tags and Arrangements.............. 287 7.3.4 Randomizing and Multiple Trees.................... 289 7.4 RandomForests...291 7.4.1 The Basic Concept................................ 291 7.4.2 The Generalization Error of the RF Ensemble......... 291 7.4.3 Out-of-the-Bag Estimates of the Error Bound......... 294 7.4.4 Variable Selection: Forest RI vs. Forest-RC........... 295

Contents XVII 7.5 Infomax and Jensen Shannon Boosting..................... 298 7.5.1 The Infomax Boosting Algorithm.................... 299 7.5.2 Jensen Shannon Boosting.......................... 305 7.6 Maximum Entropy Principle for Classification............... 308 7.6.1 Improved Iterative Scaling.......................... 308 7.6.2 Maximum Entropy and Information Projection........ 313 7.7 Bregman Divergences and Classification.................... 324 7.7.1 Concept and Properties............................ 324 7.7.2 Bregman Balls and Core Vector Machines............ 326 7.7.3 Unifying Classification: Bregman Divergences and Surrogates.................................... 331 Problems...339 7.8 Key References.......................................... 341 References...343 Index...353 Color Plates...357