Necessary and Sufficient Conditions for High-Dimensional Salient Feature Subset Recovery

Similar documents
Large-Deviations and Applications for Learning Tree-Structured Graphical Models

Second-Order Asymptotics in Information Theory

Information Theory and Statistics, Part I

LECTURE 13. Last time: Lecture outline

Lecture 22: Error exponents in hypothesis testing, GLRT

Information Theory and Hypothesis Testing

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

National University of Singapore Department of Electrical & Computer Engineering. Examination for

Homework Set #2 Data Compression, Huffman code and AEP

Network coding for multicast relation to compression and generalization of Slepian-Wolf

Asymptotic redundancy and prolixity

LECTURE 10. Last time: Lecture outline

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Guesswork Subject to a Total Entropy Budget

Information Theory in Intelligent Decision Making

INFORMATION THEORY AND STATISTICS

arxiv: v4 [cs.it] 17 Oct 2015

Two Applications of the Gaussian Poincaré Inequality in the Shannon Theory

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

EE5319R: Problem Set 3 Assigned: 24/08/16, Due: 31/08/16

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

x log x, which is strictly convex, and use Jensen s Inequality:

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science Transmission of Information Spring 2006

Information measures in simple coding problems

Data Compression. Limit of Information Compression. October, Examples of codes 1

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

Arimoto Channel Coding Converse and Rényi Divergence

Frans M.J. Willems. Authentication Based on Secret-Key Generation. Frans M.J. Willems. (joint work w. Tanya Ignatenko)

Feedback Capacity of a Class of Symmetric Finite-State Markov Channels

Tight Upper Bounds on the Redundancy of Optimal Binary AIFV Codes

Entropy as a measure of surprise

LECTURE 15. Last time: Feedback channel: setting up the problem. Lecture outline. Joint source and channel coding theorem

Lecture 8: Information Theory and Statistics

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

for some error exponent E( R) as a function R,

Lecture 8: Information Theory and Statistics

EECS 750. Hypothesis Testing with Communication Constraints

Discrete Markov Random Fields the Inference story. Pradeep Ravikumar

Hypothesis Testing with Communication Constraints

ECE Advanced Communication Theory, Spring 2009 Homework #1 (INCOMPLETE)

INFORMATION PROCESSING ABILITY OF BINARY DETECTORS AND BLOCK DECODERS. Michael A. Lexa and Don H. Johnson

Entropy, Inference, and Channel Coding

Solutions to Set #2 Data Compression, Huffman code and AEP

lossless, optimal compressor

Chapter 5: Data Compression

Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding

Introduction to Information Theory

ELEMENTS O F INFORMATION THEORY

Solutions to Homework Set #1 Sanov s Theorem, Rate distortion

IT and large deviation theory

Refined Bounds on the Empirical Distribution of Good Channel Codes via Concentration Inequalities

Coding on Countably Infinite Alphabets

Information Theory and Statistics Lecture 2: Source coding

Uncertainty. Jayakrishnan Unnikrishnan. CSL June PhD Defense ECE Department

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code

Two-Part Codes with Low Worst-Case Redundancies for Distributed Compression of Bernoulli Sequences

Maximization of the information divergence from the multinomial distributions 1

On Improved Bounds for Probability Metrics and f- Divergences

On the Complexity of Best Arm Identification with Fixed Confidence

ECE 587 / STA 563: Lecture 5 Lossless Compression

On the Entropy of Sums of Bernoulli Random Variables via the Chen-Stein Method

ECE 587 / STA 563: Lecture 5 Lossless Compression

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 3, MARCH

Coding of memoryless sources 1/35

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

On the Necessity of Binning for the Distributed Hypothesis Testing Problem

Quantitative Biology II Lecture 4: Variational Methods

Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research

. Then V l on K l, and so. e e 1.

Information Recovery from Pairwise Measurements: A Shannon-Theoretic Approach

Remote Source Coding with Two-Sided Information

Chapter 11. Information Theory and Statistics

The Method of Types and Its Application to Information Hiding

Information Theory and Feature Selection (Joint Informativeness and Tractability)

Information Theory with Applications, Math6397 Lecture Notes from September 30, 2014 taken by Ilknur Telkes

SHARED INFORMATION. Prakash Narayan with. Imre Csiszár, Sirin Nitinawarat, Himanshu Tyagi, Shun Watanabe

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

Lecture 10: Broadcast Channel and Superposition Coding

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Strong Converse and Stein s Lemma in the Quantum Hypothesis Testing

Lecture 21: Minimax Theory

Intermittent Communication

A Formula for the Capacity of the General Gel fand-pinsker Channel

10-704: Information Processing and Learning Fall Lecture 21: Nov 14. sup

Quiz 2 Date: Monday, November 21, 2016

Universal Estimation of Divergence for Continuous Distributions via Data-Dependent Partitions

Anonymous Heterogeneous Distributed Detection: Optimal Decision Rules, Error Exponents, and the Price of Anonymity

Capacity of the Discrete Memoryless Energy Harvesting Channel with Side Information

Graph Coloring and Conditional Graph Entropy

ONE of Shannon s key discoveries was that for quite

An Achievable Error Exponent for the Mismatched Multiple-Access Channel

(each row defines a probability distribution). Given n-strings x X n, y Y n we can use the absence of memory in the channel to compute

Password Cracking: The Effect of Bias on the Average Guesswork of Hash Functions

Finding the best mismatched detector for channel coding and hypothesis testing

Estimation of signal information content for classification

Lecture 3 : Algorithms for source coding. September 30, 2016

Lecture 22: Final Review

Information-theoretic Secrecy A Cryptographic Perspective

Transcription:

Necessary and Sufficient Conditions for High-Dimensional Salient Feature Subset Recovery Vincent Tan, Matt Johnson, Alan S. Willsky Stochastic Systems Group, Laboratory for Information and Decision Systems, Massachusetts Institute of Technology ISIT (Jun 14, 2010) 1/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 1 / 17

Motivation: A Real-Life Example The Manchester Asthma and Allergy Study (MAAS) is a birth-cohort study of more than n 1000 children. 2/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 2 / 17

Motivation: A Real-Life Example The Manchester Asthma and Allergy Study (MAAS) is a birth-cohort study of more than n 1000 children. www.maas.org.uk 2/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 2 / 17

Motivation: A Real-Life Example The Manchester Asthma and Allergy Study (MAAS) is a birth-cohort study of more than n 1000 children. www.maas.org.uk Number of variables d 10 6. 2/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 2 / 17

Motivation: A Real-Life Example The Manchester Asthma and Allergy Study (MAAS) is a birth-cohort study of more than n 1000 children. www.maas.org.uk Number of variables d 10 6. But only k 30 are salient for assessing susceptibility to asthma. 2/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 2 / 17

Motivation: A Real-Life Example The Manchester Asthma and Allergy Study (MAAS) is a birth-cohort study of more than n 1000 children. www.maas.org.uk Number of variables d 10 6. But only k 30 are salient for assessing susceptibility to asthma. Identification of these salient features is important. 2/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 2 / 17

Some Natural Questions Natural questions: 3/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 3 / 17

Some Natural Questions Natural questions: What is the meaning of saliency? How to define it information-theoretically? 3/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 3 / 17

Some Natural Questions Natural questions: What is the meaning of saliency? How to define it information-theoretically? How many samples n are required to identify the k salient features out of the d features? Num features d Num salient features k What are the fundamental limits for recovery of the salient set? 3/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 3 / 17

Some Natural Questions Natural questions: What is the meaning of saliency? How to define it information-theoretically? How many samples n are required to identify the k salient features out of the d features? Num features d Num salient features k What are the fundamental limits for recovery of the salient set? Are there any efficient algorithms for special classes of features? 3/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 3 / 17

Main Contributions We define the salient set by appealing to the error exponents in hypothesis testing. 4/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 4 / 17

Main Contributions We define the salient set by appealing to the error exponents in hypothesis testing. Sufficiency: We show that if for all n sufficiently large, { ( ) } d k n > max C 1 k log, exp(c 2 k), k then the error probability is arbitrarily small. 4/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 4 / 17

Main Contributions We define the salient set by appealing to the error exponents in hypothesis testing. Sufficiency: We show that if for all n sufficiently large, { ( ) } d k n > max C 1 k log, exp(c 2 k), k then the error probability is arbitrarily small. Necessity: Under certain conditions, for λ (0, 1), if ( ) d n < λ C 3 log k then the error probability 1 λ. 4/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 4 / 17

System Model Let the alphabet for each variable be X, a finite set. High-dimensional setting where d and k grow with n. Two sequences of unknown d-dimensional distributions P (d), Q (d) P(X d ), d N. 5/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 5 / 17

System Model Let the alphabet for each variable be X, a finite set. High-dimensional setting where d and k grow with n. Two sequences of unknown d-dimensional distributions P (d), Q (d) P(X d ), d N. IID samples (x n, y n ) := ({x (l) } n l=1, {y(l) } n l=1 ) drawn from P(d) Q (d). Each pair of samples x (l), y (l) X d. 5/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 5 / 17

Definition of Saliency Motivated by asymptotics of binary hypothesis testing H 0 : z n P (d), H 1 : z n Q (d) 6/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 6 / 17

Definition of Saliency Motivated by asymptotics of binary hypothesis testing H 0 : z n P (d), H 1 : z n Q (d) Chernoff-Stein Lemma: If Pr(Ĥ 1 H 0 ) < α, then Chernoff Information: Pr(Ĥ 0 H 1 ). = exp( nd(p (d) Q (d) )) Pr(err) = π 0 Pr(Ĥ 1 H 0 ) + π 1 Pr(Ĥ 0 H 1 ). = exp( nc(p (d), Q (d) )) where C(P, Q) := min t [0,1] log z P(z) t Q(z) 1 t 6/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 6 / 17

Definition of Saliency P (d) A is the marginal of P(d) on the subset A {1,..., d}. 7/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 7 / 17

Definition of Saliency P (d) A is the marginal of P(d) on the subset A {1,..., d}. Definition: A set S d {1,..., d} is KL-divergence-salient if D(P (d) Q (d) ) = D(P (d) S d Q (d) S d ). 7/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 7 / 17

Definition of Saliency P (d) A is the marginal of P(d) on the subset A {1,..., d}. Definition: A set S d {1,..., d} is KL-divergence-salient if D(P (d) Q (d) ) = D(P (d) S d Q (d) S d ). Definition: A set S d {1,..., d} is Chernoff Information-salient if C(P (d), Q (d) ) = C(P (d) S d, Q (d) S d ). 7/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 7 / 17

Characterization of Saliency Lemma The following are equivalent: S d is KL-divergence-salient. S d is Chernoff Information-salient. 8/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 8 / 17

Characterization of Saliency Lemma The following are equivalent: S d is KL-divergence-salient. S d is Chernoff Information-salient. Conditionals are identical P (d) = P (d) S d W S c d S d Q (d) = Q (d) S d W S c d S d. 8/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 8 / 17

Characterization of Saliency Lemma The following are equivalent: S d is KL-divergence-salient. S d is Chernoff Information-salient. Conditionals are identical P (d) = P (d) S d W S c d S d Q (d) = Q (d) S d W S c d S d. What are the scaling laws on (n, d, k) so that the error probability can be made arbitrarily small? 8/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 8 / 17

Definition of Achievability A decoder is a set-valued function that maps samples to subsets of size k, i.e., ( ) {1,..., d} ψ n : (X d ) n (X d ) n. k The decoder is given the true value of k, the number of salient features. We are working on relaxing this assumption. 9/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 9 / 17

Definition of Achievability A decoder is a set-valued function that maps samples to subsets of size k, i.e., ( ) {1,..., d} ψ n : (X d ) n (X d ) n. k The decoder is given the true value of k, the number of salient features. We are working on relaxing this assumption. Definition: The tuple of model parameters (n, d, k) is achievable for {P (d), Q (d) } d N if there exists a sequence of decoders {ψ n } such that q n (ψ n ) := Pr(ψ n (x n, y n ) S d ) < ɛ, n > N ɛ. 9/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 9 / 17

Three Assumptions on Distributions P (d), Q (d) Saliency: For every P (d), Q (d) there exists a salient set S d of known size k, i.e., D(P (d) Q (d) ) = D(P (d) S d Q (d) S d ). 10/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 10 / 17

Three Assumptions on Distributions P (d), Q (d) Saliency: For every P (d), Q (d) there exists a salient set S d of known size k, i.e., D(P (d) Q (d) ) = D(P (d) S d Q (d) S d ). η-distinguishability: There exists a constant η > 0 such that D(P (d) S d Q (d) for all T d S d such that T d = k. T d S d ) D(P (d) Q (d) T d ) η 10/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 10 / 17

Three Assumptions on Distributions P (d), Q (d) Saliency: For every P (d), Q (d) there exists a salient set S d of known size k, i.e., D(P (d) Q (d) ) = D(P (d) S d Q (d) S d ). η-distinguishability: There exists a constant η > 0 such that D(P (d) S d Q (d) for all T d S d such that T d = k. T d S d ) D(P (d) Q (d) T d ) η L-Boundedness: There exists a constant L (0, ) such that [ (d) ] P S log d (z Sd ) Q (d) [ L, L] S d (z Sd ) for all states z Sd X k. 10/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 10 / 17

Achievability Result Theorem If there exists an δ > 0 such that for some B > 0 { ( ) ( )} k d k 2k log X n > max B log, exp, k 1 δ then there exists a sequence of decoders ψ n that satisfies for some exponent E > 0. q n (ψ n) = O(exp( ne)), 11/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 11 / 17

Proof Idea and a Corollary Use the exhaustive search decoder. Search for the size-k set with the largest empirical KL-divergence. Large deviation bounds, e.g., Sanov s theorem. 12/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 12 / 17

Proof Idea and a Corollary Use the exhaustive search decoder. Search for the size-k set with the largest empirical KL-divergence. Large deviation bounds, e.g., Sanov s theorem. Positivity of error exponents under the specified conditions. Similar to main result in [Ng, UAI 1998] but we focus on exact subset recovery and not generalization error. 12/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 12 / 17

Proof Idea and a Corollary Use the exhaustive search decoder. Search for the size-k set with the largest empirical KL-divergence. Large deviation bounds, e.g., Sanov s theorem. Positivity of error exponents under the specified conditions. Similar to main result in [Ng, UAI 1998] but we focus on exact subset recovery and not generalization error. Corollary Let k = k 0 be a constant and R (0, B/k 0 ). Then if n > log d R 12/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 12 / 17

Proof Idea and a Corollary Use the exhaustive search decoder. Search for the size-k set with the largest empirical KL-divergence. Large deviation bounds, e.g., Sanov s theorem. Positivity of error exponents under the specified conditions. Similar to main result in [Ng, UAI 1998] but we focus on exact subset recovery and not generalization error. Corollary Let k = k 0 be a constant and R (0, B/k 0 ). Then if n > log d R q n (ψ n) = O (exp( ne)). 12/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 12 / 17

Converse Result We assume that the salient set S d is chosen uniformly at random over all subsets of size k. 13/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 13 / 17

Converse Result We assume that the salient set S d is chosen uniformly at random over all subsets of size k. Theorem If for some λ (0, 1), then n < λ k log ( ) d k H(P (d) ) + H(Q (d) ) q n (ψ n ) 1 λ for all decoders ψ n. 13/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 13 / 17

Remarks Proof is a consequence of Fano s inequality. 14/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 14 / 17

Remarks Proof is a consequence of Fano s inequality. Bound never satisfied if variables in Sd c are uniform and independent of S d. ( ) λ d n < H(P (d) ) + H(Q (d) k log ) k }{{} O(d) However, converse is interesting if distributions have additional structure on their entropies. 14/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 14 / 17

Remarks Proof is a consequence of Fano s inequality. Bound never satisfied if variables in Sd c are uniform and independent of S d. ( ) λ d n < H(P (d) ) + H(Q (d) k log ) k }{{} O(d) However, converse is interesting if distributions have additional structure on their entropies. Assume most of the features are processed or redundant. Example: there could be two features, BMI and isobese. One is a processed version of the other. 14/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 14 / 17

Converse Result Corollary Assume that there there exists a M < such that the conditional entropies satisfy { max H ( P (d) ) ( (d) Sd c S d, H Q Sd d) } c S M k. 15/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 15 / 17

Converse Result Corollary Assume that there there exists a M < such that the conditional entropies satisfy { max H ( P (d) ) ( (d) Sd c S d, H Q Sd d) } c S M k. If the number of samples satisfies ( ) λ d n < 2(M + log X ) log k then q n (ψ n ) 1 λ. 15/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 15 / 17

Comparison to Achievability Result Assume d = exp(nr) and k is constant. 16/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 16 / 17

Comparison to Achievability Result Assume d = exp(nr) and k is constant. There is a rate R ach so that if R < R ach, then (n, d, k) is achievable. Conversely, there is another rate R conv so that if R > R conv, then recovery is not possible. 0 R 16/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 16 / 17

Comparison to Achievability Result Assume d = exp(nr) and k is constant. There is a rate R ach so that if R < R ach, then (n, d, k) is achievable. Conversely, there is another rate R conv so that if R > R conv, then recovery is not possible. 0 R R ach 16/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 16 / 17

Comparison to Achievability Result Assume d = exp(nr) and k is constant. There is a rate R ach so that if R < R ach, then (n, d, k) is achievable. Conversely, there is another rate R conv so that if R > R conv, then recovery is not possible. 0 R R ach R conv 16/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 16 / 17

Conclusions Provided an information-theoretic definition of saliency motivated by error exponents in hypothesis testing. 17/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 17 / 17

Conclusions Provided an information-theoretic definition of saliency motivated by error exponents in hypothesis testing. Provided necessary and sufficient conditions for salient set recovery. Number of samples n can be much smaller than d, total number of variables. 17/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 17 / 17

Conclusions Provided an information-theoretic definition of saliency motivated by error exponents in hypothesis testing. Provided necessary and sufficient conditions for salient set recovery. Number of samples n can be much smaller than d, total number of variables. In the paper, we provide a computationally efficient and consistent algorithm to search for S d when P (d) and Q (d) are Markov on trees. 17/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 17 / 17