Necessary and Sufficient Conditions for High-Dimensional Salient Feature Subset Recovery

Necessary and Sufficient Conditions for High-Dimensional Salient Feature Subset Recovery Vincent Tan, Matt Johnson, Alan S. Willsky Stochastic Systems Group, Laboratory for Information and Decision Systems, Massachusetts Institute of Technology ISIT (Jun 14, 2010) 1/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 1 / 17

Motivation: A Real-Life Example The Manchester Asthma and Allergy Study (MAAS) is a birth-cohort study of more than n 1000 children. 2/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 2 / 17

Motivation: A Real-Life Example The Manchester Asthma and Allergy Study (MAAS) is a birth-cohort study of more than n 1000 children. www.maas.org.uk 2/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 2 / 17

Motivation: A Real-Life Example The Manchester Asthma and Allergy Study (MAAS) is a birth-cohort study of more than n 1000 children. www.maas.org.uk Number of variables d 10 6. 2/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 2 / 17

Motivation: A Real-Life Example The Manchester Asthma and Allergy Study (MAAS) is a birth-cohort study of more than n 1000 children. www.maas.org.uk Number of variables d 10 6. But only k 30 are salient for assessing susceptibility to asthma. 2/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 2 / 17

Motivation: A Real-Life Example The Manchester Asthma and Allergy Study (MAAS) is a birth-cohort study of more than n 1000 children. www.maas.org.uk Number of variables d 10 6. But only k 30 are salient for assessing susceptibility to asthma. Identification of these salient features is important. 2/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 2 / 17

Some Natural Questions Natural questions: 3/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 3 / 17

Some Natural Questions Natural questions: What is the meaning of saliency? How to define it information-theoretically? 3/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 3 / 17

Some Natural Questions Natural questions: What is the meaning of saliency? How to define it information-theoretically? How many samples n are required to identify the k salient features out of the d features? Num features d Num salient features k What are the fundamental limits for recovery of the salient set? 3/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 3 / 17

Some Natural Questions Natural questions: What is the meaning of saliency? How to define it information-theoretically? How many samples n are required to identify the k salient features out of the d features? Num features d Num salient features k What are the fundamental limits for recovery of the salient set? Are there any efficient algorithms for special classes of features? 3/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 3 / 17

Main Contributions We define the salient set by appealing to the error exponents in hypothesis testing. 4/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 4 / 17

Main Contributions We define the salient set by appealing to the error exponents in hypothesis testing. Sufficiency: We show that if for all n sufficiently large, { ( ) } d k n > max C 1 k log, exp(c 2 k), k then the error probability is arbitrarily small. 4/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 4 / 17

Main Contributions We define the salient set by appealing to the error exponents in hypothesis testing. Sufficiency: We show that if for all n sufficiently large, { ( ) } d k n > max C 1 k log, exp(c 2 k), k then the error probability is arbitrarily small. Necessity: Under certain conditions, for λ (0, 1), if ( ) d n < λ C 3 log k then the error probability 1 λ. 4/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 4 / 17

System Model Let the alphabet for each variable be X, a finite set. High-dimensional setting where d and k grow with n. Two sequences of unknown d-dimensional distributions P (d), Q (d) P(X d ), d N. IID samples (x n, y n ) := ({x (l) } n l=1, {y(l) } n l=1 ) drawn from P(d) Q (d). Each pair of samples x (l), y (l) X d. 5/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 5 / 17

Definition of Saliency Motivated by asymptotics of binary hypothesis testing H 0 : z n P (d), H 1 : z n Q (d) 6/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 6 / 17

Definition of Saliency Motivated by asymptotics of binary hypothesis testing H 0 : z n P (d), H 1 : z n Q (d) Chernoff-Stein Lemma: If Pr(Ĥ 1 H 0 ) < α, then Chernoff Information: Pr(Ĥ 0 H 1 ). = exp( nd(p (d) Q (d) )) Pr(err) = π 0 Pr(Ĥ 1 H 0 ) + π 1 Pr(Ĥ 0 H 1 ). = exp( nc(p (d), Q (d) )) where C(P, Q) := min t [0,1] log z P(z) t Q(z) 1 t 6/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 6 / 17

Definition of Saliency P (d) A is the marginal of P(d) on the subset A {1,..., d}. 7/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 7 / 17

Definition of Saliency P (d) A is the marginal of P(d) on the subset A {1,..., d}. Definition: A set S d {1,..., d} is KL-divergence-salient if D(P (d) Q (d) ) = D(P (d) S d Q (d) S d ). 7/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 7 / 17

Definition of Saliency P (d) A is the marginal of P(d) on the subset A {1,..., d}. Definition: A set S d {1,..., d} is KL-divergence-salient if D(P (d) Q (d) ) = D(P (d) S d Q (d) S d ). Definition: A set S d {1,..., d} is Chernoff Information-salient if C(P (d), Q (d) ) = C(P (d) S d, Q (d) S d ). 7/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 7 / 17

Characterization of Saliency Lemma The following are equivalent: S d is KL-divergence-salient. S d is Chernoff Information-salient. 8/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 8 / 17

Characterization of Saliency Lemma The following are equivalent: S d is KL-divergence-salient. S d is Chernoff Information-salient. Conditionals are identical P (d) = P (d) S d W S c d S d Q (d) = Q (d) S d W S c d S d. 8/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 8 / 17

Characterization of Saliency Lemma The following are equivalent: S d is KL-divergence-salient. S d is Chernoff Information-salient. Conditionals are identical P (d) = P (d) S d W S c d S d Q (d) = Q (d) S d W S c d S d. What are the scaling laws on (n, d, k) so that the error probability can be made arbitrarily small? 8/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 8 / 17

Definition of Achievability A decoder is a set-valued function that maps samples to subsets of size k, i.e., ( ) {1,..., d} ψ n : (X d ) n (X d ) n. k The decoder is given the true value of k, the number of salient features. We are working on relaxing this assumption. 9/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 9 / 17

Definition of Achievability A decoder is a set-valued function that maps samples to subsets of size k, i.e., ( ) {1,..., d} ψ n : (X d ) n (X d ) n. k The decoder is given the true value of k, the number of salient features. We are working on relaxing this assumption. Definition: The tuple of model parameters (n, d, k) is achievable for {P (d), Q (d) } d N if there exists a sequence of decoders {ψ n } such that q n (ψ n ) := Pr(ψ n (x n, y n ) S d ) < ɛ, n > N ɛ. 9/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 9 / 17

Three Assumptions on Distributions P (d), Q (d) Saliency: For every P (d), Q (d) there exists a salient set S d of known size k, i.e., D(P (d) Q (d) ) = D(P (d) S d Q (d) S d ). 10/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 10 / 17

Three Assumptions on Distributions P (d), Q (d) Saliency: For every P (d), Q (d) there exists a salient set S d of known size k, i.e., D(P (d) Q (d) ) = D(P (d) S d Q (d) S d ). η-distinguishability: There exists a constant η > 0 such that D(P (d) S d Q (d) for all T d S d such that T d = k. T d S d ) D(P (d) Q (d) T d ) η 10/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 10 / 17

Three Assumptions on Distributions P (d), Q (d) Saliency: For every P (d), Q (d) there exists a salient set S d of known size k, i.e., D(P (d) Q (d) ) = D(P (d) S d Q (d) S d ). η-distinguishability: There exists a constant η > 0 such that D(P (d) S d Q (d) for all T d S d such that T d = k. T d S d ) D(P (d) Q (d) T d ) η L-Boundedness: There exists a constant L (0, ) such that [ (d) ] P S log d (z Sd ) Q (d) [ L, L] S d (z Sd ) for all states z Sd X k. 10/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 10 / 17

Achievability Result Theorem If there exists an δ > 0 such that for some B > 0 { ( ) ( )} k d k 2k log X n > max B log, exp, k 1 δ then there exists a sequence of decoders ψ n that satisfies for some exponent E > 0. q n (ψ n) = O(exp( ne)), 11/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 11 / 17

Proof Idea and a Corollary Use the exhaustive search decoder. Search for the size-k set with the largest empirical KL-divergence. Large deviation bounds, e.g., Sanov s theorem. Positivity of error exponents under the specified conditions. Similar to main result in [Ng, UAI 1998] but we focus on exact subset recovery and not generalization error. 12/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 12 / 17

Proof Idea and a Corollary Use the exhaustive search decoder. Search for the size-k set with the largest empirical KL-divergence. Large deviation bounds, e.g., Sanov s theorem. Positivity of error exponents under the specified conditions. Similar to main result in [Ng, UAI 1998] but we focus on exact subset recovery and not generalization error. Corollary Let k = k 0 be a constant and R (0, B/k 0 ). Then if n > log d R 12/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 12 / 17

Proof Idea and a Corollary Use the exhaustive search decoder. Search for the size-k set with the largest empirical KL-divergence. Large deviation bounds, e.g., Sanov s theorem. Positivity of error exponents under the specified conditions. Similar to main result in [Ng, UAI 1998] but we focus on exact subset recovery and not generalization error. Corollary Let k = k 0 be a constant and R (0, B/k 0 ). Then if n > log d R q n (ψ n) = O (exp( ne)). 12/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 12 / 17

Converse Result We assume that the salient set S d is chosen uniformly at random over all subsets of size k. 13/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 13 / 17

Converse Result We assume that the salient set S d is chosen uniformly at random over all subsets of size k. Theorem If for some λ (0, 1), then n < λ k log ( ) d k H(P (d) ) + H(Q (d) ) q n (ψ n ) 1 λ for all decoders ψ n. 13/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 13 / 17

Remarks Proof is a consequence of Fano s inequality. 14/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 14 / 17

Remarks Proof is a consequence of Fano s inequality. Bound never satisfied if variables in Sd c are uniform and independent of S d. ( ) λ d n < H(P (d) ) + H(Q (d) k log ) k }{{} O(d) However, converse is interesting if distributions have additional structure on their entropies. 14/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 14 / 17

Remarks Proof is a consequence of Fano s inequality. Bound never satisfied if variables in Sd c are uniform and independent of S d. ( ) λ d n < H(P (d) ) + H(Q (d) k log ) k }{{} O(d) However, converse is interesting if distributions have additional structure on their entropies. Assume most of the features are processed or redundant. Example: there could be two features, BMI and isobese. One is a processed version of the other. 14/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 14 / 17

Converse Result Corollary Assume that there there exists a M < such that the conditional entropies satisfy { max H ( P (d) ) ( (d) Sd c S d, H Q Sd d) } c S M k. 15/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 15 / 17

Converse Result Corollary Assume that there there exists a M < such that the conditional entropies satisfy { max H ( P (d) ) ( (d) Sd c S d, H Q Sd d) } c S M k. If the number of samples satisfies ( ) λ d n < 2(M + log X ) log k then q n (ψ n ) 1 λ. 15/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 15 / 17

Comparison to Achievability Result Assume d = exp(nr) and k is constant. 16/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 16 / 17

Comparison to Achievability Result Assume d = exp(nr) and k is constant. There is a rate R ach so that if R < R ach, then (n, d, k) is achievable. Conversely, there is another rate R conv so that if R > R conv, then recovery is not possible. 0 R 16/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 16 / 17

Comparison to Achievability Result Assume d = exp(nr) and k is constant. There is a rate R ach so that if R < R ach, then (n, d, k) is achievable. Conversely, there is another rate R conv so that if R > R conv, then recovery is not possible. 0 R R ach 16/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 16 / 17

Comparison to Achievability Result Assume d = exp(nr) and k is constant. There is a rate R ach so that if R < R ach, then (n, d, k) is achievable. Conversely, there is another rate R conv so that if R > R conv, then recovery is not possible. 0 R R ach R conv 16/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 16 / 17

Conclusions Provided an information-theoretic definition of saliency motivated by error exponents in hypothesis testing. 17/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 17 / 17

Conclusions Provided an information-theoretic definition of saliency motivated by error exponents in hypothesis testing. Provided necessary and sufficient conditions for salient set recovery. Number of samples n can be much smaller than d, total number of variables. 17/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 17 / 17

Conclusions Provided an information-theoretic definition of saliency motivated by error exponents in hypothesis testing. Provided necessary and sufficient conditions for salient set recovery. Number of samples n can be much smaller than d, total number of variables. In the paper, we provide a computationally efficient and consistent algorithm to search for S d when P (d) and Q (d) are Markov on trees. 17/17 Vincent Tan (SSG, LIDS, MIT) Salient Feature Subset Recovery ISIT 2010 17 / 17