Nonparametric Bayes tensor factorizations for big data

Size: px

Start display at page:

Download "Nonparametric Bayes tensor factorizations for big data"

Darren Leslie Owen
6 years ago
Views:

1 Nonparametric Bayes tensor factorizations for big data David Dunson Department of Statistical Science, Duke University Funded from NIH R01-ES017240, R01-ES & DARPA N C-2082

2 Motivation Conditional tensor factorizations Some properties - heuristic & otherwise Computation & applications Generalizations

3 Motivating setting - high dimensional predictors Routine to encounter massive-dimensional prediction & variable selection problems We have y Y & x = (x 1,..., x p ) X Unreasonable to assume linearity or additivity in motivating applications - e.g., epidemiology, genomics, neurosciences Goal: nonparametric approaches that accommodate large p, small n, allow interactions, scale computationally to big p

4 Gaussian processes with variable selection For Y = R & X R p, then one approach lets y i = µ(x i ) + ɛ i, ɛ i N(0, σ 2 ), where µ : X R is an unknown regression function Following Zou et al. (2010) & others, { µ GP(m, c), c(x, x ) = φ exp with mixture priors placed on α j s p α j (x j x j ) }, 2 j=1 Zou et al. (2010) show good empirical results Bhattacharya, Pati & Dunson (2011) - minimax adaptive rates

5 Issues & alternatives Mean regression & computation challenging Difficult computationally beyond conditionally Gaussian homoscedastic case Density regression interesting as variance & shape of response distribution often changes with x Initial focus: classification from many categorical predictors Approach generalizes directly to arbitrary Y and X.

6 Classification & conditional probability tensors Suppose Y {1,..., d 0 } & X j {1,..., d j },j=1,...,p The classification function or conditional probability is Pr(Y = y X 1 = x 1,..., X p = x p ) = P(y x 1,..., x p ). This classification function can be structured as a d 0 d 1 d p tensor Let P d1,...,d p (d 0 ) denote to set of all possible conditional probability tensors P P d1,...,d p (d 0 ) implies P(y x 1,..., x p ) 0 y, x 1,..., x p & d0 y=1 P(y x 1,..., x p ) = 1

7 Tensor factorizations P = big tensor & data will be very sparse If P was a matrix, we may think of SVD We can instead consider a tensor factorization Common approach is PARAFAC - sum of rank one tensors Tucker factorizations express d 1 d p tensor A = {a c1 c p } as a c1 c p = d 1 h 1 =1 d j p g h1 h p h p=1 j=1 where G = {g h1 h p } is a core tensor, u (j) h j c j,

8 Our factorization (with Yun Yang) Our proposed nonparametric model for the conditional probability: P(y x 1,..., x p ) = k 1 h 1 =1 k p h p=1 λ h1 h 2...h p (y) p j=1 Tucker factorization of the conditional probability P π (j) h j (x j ) (1) To be valid conditional probability, parameters subject to d 0 c=1 k j h=1 λ h1 h 2...h p (c) = 1, for any (h 1, h 2,..., h p ), π (j) h (x j) = 1, for any possible pair of (j, x j ). (2)

9 Comments on proposed factorization k j = 1 corresponds to exclusion of the jth feature By placing prior on k j, can induce variable selection & learning of dimension of factorization Representation is many-to-one and the parameters in the factorization cannot be uniquely identified. Does not present a barrier to Bayesian inference - we don t care about the parameters in factorization We want to do variable selection, prediction & inferences on predictor effects

10 Theoretical support The following Theorem formalizes the flexibility: Theorem Every d 0 d 1 d 2 d p conditional probability tensor P P d1,...,d p (d 0 ) can be decomposed as (1), with 1 k j d j for j = 1,..., p. Furthermore, λ h1 h 2...h p (y) and π (j) h j (x j ) can be chosen to be nonnegative and satisfy the constraints (2).

11 Latent variable representation Simplify representation through introducing p latent class indicators z 1,..., z p for X 1,..., X p Conditional independence of Y and (X 1,..., X p ) given (z 1,..., z p ) The model can be written as Y i z i1,..., z ip Mult ( {1,..., d 0 }, λ zi1,...,z ip ), z ij X ij = x j Mult ( {1,..., k j }, π (j) 1 (x j),..., π (j) k j (x j ) ), Useful computationally & provides some insight into the model

12 Prior specification & hierarchical model Conditional likelihood of response is (Y i z i1,..., z ip, Λ) Multinomial ( {1,..., d 0 }, λ zi1,...,z ip ) Conditional likelihood of latent class variables is (z ij X ij = x j, π) Multinomial ( {1,..., k j }, π (j) 1 (x j),..., π (j) k j (x j ) ) Prior on core tensor λ h1,...,h p = ( λh1,...,h p (1),..., λ h1,...,h p (d 0 ) ) Diri(1/d 0,..., 1/d 0 ) Prior on independent rank one components, ( π (j) 1 (x j),..., π (j) k j (x j ) ) Diri(1/k j,..., 1/k j )

13 Prior on predictor inclusion/tensor rank For the jth dimension, we choose the simple prior P(k j = 1) = 1 r p, P(k j = k) = d j =# levels of covariate X j. r (d j 1)p, k = 2,..., d j, r =expected # important features, r =specified maximum number of features Effective prior on k j s is P(k 1 = l 1,..., k p = l p ) = P(k 1 = l 1 ) P(k p = l p )I { {j:lj >1} r}(l 1,..., l p ), where I A ( ) is the indicator function for set A.

14 Properties - Bias-Variance Tradeoff Extreme data sparsity - vast majority of combinations of Y, X 1,..., X p not observed Critical to include sparsity assumptions - even if such assumptions do not hold, massively reduces the variance Discard predictors having small impact & parameters having small values Makes the problem tractable & may lead to good MSE

15 Illustrative example Binary Y & p binary covariates X j { 1, 1}, j = 1,..., p The true model can be expressed in the form [β (0, 1)] P(Y = 1 X 1 = x 1,..., X p = x p ) = β 2 2 x β 2 p+1 x p. Effect of X j decreases exponentially as j increases from 1 to p. Natural strategy: estimate P(Y = 1 X 1 = x 1,..., X p = x p ) by sample frequencies over 1st k covariates {i : y i = 1, x 1i = x 1,..., x ki = x k }, {i : x 1i = x 1,..., x ki = x k } & ignore the remaining p k covariates. Suppose we have n = 2 l (k l p) observations with one in each cell of combinations of X 1,..., X l.

16 MSE analysis Mean square error (MSE) can be expressed as MSE = h 1,...,h p E Bias 2 + Var. { P(Y = 1 X1 = h 1,..., X p = h p ) ˆP(Y = 1 X 1 = h 1,..., X k = h k ) } 2 The squared bias is Bias 2 = { P(Y = 1 X1 = h 1,..., X p = h p ) h 1,...,h p = β 2 2 E ˆP(Y = 1 X 1 = h 1,..., X k = h k ) } 2 ( ) 2i p+1 = β2 3 (2p 2k 2 2 p 2 ). k+1 2p k 1 i=1

17 MSE analysis (continued) Finally we obtain the variance as Var = VarˆP(Y = 1 X 1 = h 1,..., X k = h k ) h 1,...,h p 2k 1 ( p k = 2 2 l 2 + 2i 1 )( 1 2 k+1 β 2 2i 1 ) 2 k+1 β i=1 = 1 3{ (3 β 2 )2 p+k l 2 + β 2 2 p k l 2}. Since there are 2 p cells, the average MSE for each cell equals 1{ (3 β 2 )2 k l 2 + β 2 2 k l 2 + β 2 2 2k 2 β 2 2 2p 2}. 3

18 Implications of MSE analysis #predictors p has little impact on selection of k k l & so second term small comparing to 1st & 3rd terms Average MSE obtains its minimum at k l/3 = log 2 (n)/3 True model not sparse & all the predictors impact conditional probability But optimal # predictors only depends on the log sample size

19 Borrowing of information Critical feature of our model is borrowing across cells Letting w h1,...,h p (x 1,..., x p ) = j π(j) h j (x j ), our model is P(Y = y X 1 = x 1,..., X p = x p ) = h 1,...,h p w h1,...,h p (x 1,..., x p )λ h1...h p (y), with h 1,...,h p w h1,...,h p (x 1,..., x p ) = 1. View λ h1...h p (y) as frequency of Y = y in cell X 1 = h 1,..., X p = h p We have kernel estimate for borrowing information via weighted avg of cell freqs

20 Illustrative example One covariate X {1,..., m} with Y {0, 1} & P j = P(Y = 1 X = j) Naive estimate ˆP j = k j /n j = {i : y i = 1, x i = j}/ {i : x i = j} = sample freqs Alternatively, consider kernel estimate indexed by 0 c 1/(m 1) P j = {1 (m 1)c}ˆP j + c k j ˆP k, j = 1,..., m. Use squared error loss to compare these estimators

21 MSE for illustrative example E{L(ˆP, P)} = m j=1 E(ˆP j P j ) 2 = m P j (1 P j ) j=1 n j. E{L( P, P)} = m j=1 E( P j P j ) 2 is fn of c with min at c 0 = 1 E{L(ˆP, P)} m E{L(ˆP, P)} + 1 m 1 i<j (P i P j ) 2 ( 0, ) 1. m 1 When P j s are similar, estimate P can reduce risk up to only 1/m the risk of estimating ˆP separately. If P j s are not similar, P can still reduce the risk considerably when the cell counts {n j } are small.

22 Setting & assumptions Data y n & X n for n subjects with p n n (large p, small n) Assume d j = d for simplicity in exposition Putting true model P 0 in our tensor form, assume Assumption A. p n j=1 max x j d h j =2 π(j) h j (x j ) <. This is a near sparsity restriction on P 0. Additionally assume true conditional probabilities strictly greater than zero, Assumption B. P 0 (y x) ɛ 0 for any x, y for some ɛ 0 > 0.

23 Posterior convergence theorem x 1,..., x n independent from unknown G n on {1,..., d} pn Let ɛ n be a sequence with ɛ n 0, nɛ 2 n and n exp( nɛ2 n) <. Assume the following conditions hold: (i) r n log p n nɛ 2 n, (ii) r n d rn log( r n /ɛ n ) nɛ 2 n, (iii) r n /p n 0 as n, and (iv) there exists a sequence of models γ n with size r n such that j / γ n max xj d h j =2 π(j) h j (x j ) ɛ 2 n. Denote d(p, P 0 ) = d0 P(y x1,..., x p ) P 0 (y x 1,..., x p ) Gn (dx 1,..., dx p ), then y=1 Π n { P : d(p, P0 ) Mɛ n y n, X n} 0 a.s.p n 0.

24 Implications of theorem Posterior convergence rate can be very close to n 1/2 for appropriate hyperparameter choices. For any α (0, 1), ɛ n = n (1 α)/2 log n satisfies conditions rn r n log n (# important predictors scales w/ log n) pn exp(n α ) (# candidate predictors exponential in n) There exists a sequence of models γn with size r n such that max x j j / γ n d h j =2 π (j) h j (x j ) n α 1 log 2 n. Use r n = log d (n), r n = 2r n as default values for the prior in applications.

25 Posterior computation Conditionally on {k j } simple Gibbs sampler - Dirichlet & multinomial conditionals # components to update p j=1 k j - can blow up with p but all but small number of k j = 1 To update {k j } we can use RJMCMC - current vs doesn t scale very well computationally Two stage algorithm: (i) SSVS to estimate k j - acceptance probs use approximated conditional marginal likelihoods; (ii) conditionally on {ˆk j } run Gibbs Scales efficiently & excellent performance in cases we have considered

26 Simulation study N = 2, 000 instances, p = 600 covariates X j {1,..., 4} & binary response Y True model: 3 important predictors X 9, X 11 and X 13 Generate P(Y = 1 X 9 = x 9, X 11 = x 11, X 13 = x 13 ) independently for each combination of (x 9, x 11, x 13 ). To obtain optimal misclassification rate 15%, generated f (U) = U 2 /{U 2 + (1 U) 2 }, U Unif(0, 1). n training samples & N n testing Training - n {200, 400, 600, 800}, with 10 random training-test splits. Apply our approach to each split.

27 Simulation results - misclassification rate Table: Testing Results for Synthetic Data Example. RF: random forests; TF: Our tensor factorization model. training size amse of TF Misclassification Rate of TF Misclassification Rate of RF amse = 1 { P(Y = 1 x1 4 p,..., x p ) ˆP(Y = 1 x 1,..., x p ) } 2, x 1,...,x p

28 Simulation results - variable selection performance Table: Columns 2-4 = inclusion probs of 9th,11th,13th predictors. Col 5 = max inclusion prob across remaining predictors. Col 6 = average inclusion probability across the remaining predictors. Quantities are averages over 10 trials. training size Max Average

29 Application data sets 1. Promoter gene sequences: A, C, G, T nucleotides at p = 57 positions for N = 106 sequences & binary response (promoter or not). 5-fold CV - n = 85 training & N n = 21 test samples in each split. 2. Splice-junction gene sequences: A, C, G, T nucleotides at p = 60 positions for N = 3, 175 sequences. response classes: exon/intron boundary (EI), intron/exon boundary (IE) or neither (N). Test - n {200, 2540}. 3. Single Proton Emission Computed Tomography (SPECT): cardiac patients normal/abnormal. 267 SPECT images & 22 binary features. Previously divided - n = 80 & N n = 187.

30 Results Table: RF: random forests, NN: neural networks, SVM: support vector machine, BART: Bayesian additive regression trees, TF: Our tensor factorization model. Misclassification rates are displayed. Data CART RF NN LASSO SVM BART TF Promoter (n=85) Splice (n=200) Splice (n=2540) SPECT (n=80) At worst comparable classification performance with RF best of competitors Particularly good relative performance as n decreases & p increases

31 Variable selection - interpretability & parsimony Additional advantages in terms of variable selection In promoter data, selected nucleotides at 15th, 16th, 17th, and 39th positions In splice data, 28th, 29th, 30th, 31st, 32nd and 35th positions are selected. In SPECT data, 11st, 13rd and 16th predictors are selected. Each case obtained excellent classification performance based on a small subset of the predictors.

32 Generalization - conditional distribution modeling Generalization: conditional distribution estimation f (y x) = k k 1 h=1 h 1 =1 k p h p=1 π hh1 h p (x)k(y; θ hh1 h p ), {π hh1 h p (x)} = core probability tensor of predictor-dependent weights on a multiway array of kernels Motivated by above conditional tensor factorization for classification, let π hh1 h p = π h p j=1 π (j) h j (x j ).

33 Linear Tucker density regression Letting x X = [0, 1] p and ψ j [0, 1], choose π (j) 1 (x j) = 1 x j ψ j, π (j) 2 (x j) = x j ψ j, k j = 2, j = 1,..., p, Model linearly interpolates but otherwise is extremely flexible In simple case in which p = 1 & Gaussian kernel, we have f (y x) = k h=1 π h { (1 xψ)n(y; µh1, τ 1 h1 ) + xψn(y; µ h2, τ 1 h2 )}, Induces the linear mean regression model { k E(y x) = h=1 } { k π h µ h1 + h=1 } π h ψ(µ h2 µ h1 ) x = β 0 + β 1 x,

34 Linear Tucker density regression - comments Different quantiles of f (y x) change linearly with x but with slopes that vary If k = 1, and τ h1 = τ h2 = τ h, obtain simple normal linear regression Density changes linearly - f (y x = 0) = h π hn(y; µ h1, τ 1 to f (y x = 1) = h π hn(y; µ h2, τ 1 h2 ) as x increases As p increases, still interpolate linearly but accommodate interactions Posterior computation single stage Gibbs sampler (multinomial, Dirichlet, normal-gamma, Bernoulli, beta) h1 )

35 Joint tensor factorizations Focused in this talk on conditional Tucker factorizations Can also use probabilistic tensor factorizations for joint modeling Very useful for huge sparse contingency table analysis Same ideas provide type of multivariate generalization of current Bayes discrete mixtures Instead of a single cluster index, multiple dependent cluster indices underlying each type of data

36 References Banerjee, A., Murray, J. & Dunson, D.B. (2012). Nonparametric Bayes infinite tensor factorization priors. Bhattacharya, A. and Dunson, D.B. (2012). Simplex factor models for multivariate unordered categorical data. JASA, 107, Bhattacharya, A. and Dunson, D.B. (2012). Nonparametric Bayes testing of associations in high-dimensional categorical data. almost done! Dunson, D.B. and Xing, C. (2009). Nonparametric Bayes modeling of multivariate categorical data. JASA, 104, Ghosh, A. and Dunson, D.B. (2012). Conditional distribution from high-dimensional interacting predictors. in progress. Yang, Y. and Dunson, D.B. (2012). Bayesian conditional tensor factorizations for high-dimensional classification. arxiv.

Bayesian Conditional Tensor Factorizations for High-Dimensional Classification

Bayesian Conditional Tensor Factorizations for High-Dimensional Classification Yun Yang Department of Statistical Science, Duke University, Durham, USA. David B. Dunson Department of Statistical Science,