39th Annual ISMS Marketing Science Conference University of Southern California, June 8, 2017

Size: px

Start display at page:

Download "39th Annual ISMS Marketing Science Conference University of Southern California, June 8, 2017"

Alexina Wells
5 years ago
Views:

1 Permuted and IROM Department, McCombs School of Business The University of Texas at Austin 39th Annual ISMS Marketing Science Conference University of Southern California, June 8, / 36

2 Joint work with Quan Zhang McCombs PhD student in Risk Analysis and Decision Making 2 / 36

3 1 Overview 2 logistic Permuted and augmented () logistic 3 Binary M 4 Binary 5 3 / 36

4 construction Provide a size-biased random permutation of the random draw from a Dirichlet process (Sethuraman 1994). Stick success probabilities can depend on covariates (Dunson and Park 2008, Chung and Dunson 2009, Ren et al 2011). Can be used to model the dependencies between probabilities by logit-gaussian distribution/process (Linderman et al 2015). 4 / 36

5 logistic Permuted and augmented () logistic for Drawing y i (p i1,..., p is ) is equivalent to drawing a sequence of binary random variables as [( b is {b ij } j<s Bernoulli 1 ) ] j<s b ij π is, p is π is = 1 j<s p, s = 1, 2,..., S. ij If we define y i = s if and only if b is = 1 and b ij = 0 for all j s, then we have p is = P(y i = s {π is } 1,S ) = P(b is = 1) j s P(b ij = 0) = (π is ) 1(s S) j<s(1 π ij ). 5 / 36

6 logistic Permuted and augmented () logistic Theorem (1) Suppose y i S s=1 p isδ s, where [p i1,..., p is ] is a probability vector whose elements are constructed as p is = (π is ) 1(s S) j<s (1 π ij), then y i can be equivalently generated from augmented (asb) as S y i 1(b is = 1) 1(s S) 1(b ij = 0) δ s, s=1 j<s b is Bernoulli(π is ), s {1,..., S}. 6 / 36

7 logistic Permuted and augmented () logistic transforms the problem of with S categories into the problem of S conditionally independent binary s. Gibbs sampling: Sample b is for s {1,..., S}: b is = 0 if s < y i b is = 1 if s = y i b is Bernoulli(π is ) if s > y i Solve b is Bernoulli(π is ) for s {1,..., S}, where the covariate-dependent stick probability π is for the sth stick/category is a deterministic function of x i β s. Any binary model (with cross entropy loss) can be generalized to a one under augmented, but a naive combination may not work well. 7 / 36

8 logistic Permuted and augmented () logistic : augmented logistic regrssion ex i β s If we let π is =, which means logit(π 1 + e x i β is ) = x i β s, then s we have p is = ex i β s e x i β s 1 + e x i β j j<s logistic : S y i 1(b is = 1) 1(s S) 1(b ij = 0) δ s, s=1 j<s ( ) b is Bernoulli ex i β s π is = 1 + e x i β s, s {1,..., S}. 8 / 36

9 logistic Permuted and augmented () logistic Gibbs sampling: Sample b is for s {1,..., S}: b is = 0 if s < y i b is = 1 if s = y i b is Bernoulli(π is ) if s > y ( i Solve b is Bernoulli ex i β s π is = 1 + e x i β s β s can be inferred with the Polya-Gamma data augmentation. Closed-form Gibbs sampling update equations. Problem solved? End of the talk?? Not really... ) : 9 / 36

10 logistic Permuted and augmented () logistic The number of geometric constraints increases in s. p i1 = ( 1 + e x i β 1) 1 is larger than 0.5 if x iβ 1 > 0. p i2 = ( 1 + e x i β 1) 1 ( 1 + e x i β 2) 1 is possible to be larger than 0.5 only if both x iβ 1 < 0 and x iβ 2 > 0 p i3 = ( 1 + e x i β 1) 1 ( 1 + e x i β 2) 1 ( 1 + e x i β 3) 1 is possible to be larger than 0.5 only if x iβ 1 < 0, x iβ 2 < 0, and x iβ 3 > / 36

logistic Permuted and augmented () logistic logistic is not invariant to label permutation, and may or may not work depending on how the categories are labeled (ordered).

11 logistic Permuted and augmented () logistic logistic is not invariant to label permutation, and may or may not work depending on how the categories are labeled (ordered). For the Iris data (sepal and petal lengths as covaraites), it does not work well if 1st category/stick: blue points (middle) 2nd category/stick: black points (bottom) 3rd category/stick: gray points (top) It works well as long as the blue points are not labeled as the first category (stick). 11 / 36

logistic Permuted and augmented () logistic If some categories are not linearly separable, augmented logistic may not work well no matter how the

12 logistic Permuted and augmented () logistic If some categories are not linearly separable, augmented logistic may not work well no matter how the categories are labeled. How to address the sensitivity to label permutation? How to separate the categories that are not linearly separable? 12 / 36

13 Permuted and augmented () logistic Permuted and augmented () logistic Denote z = (z 1,..., z S ) as a permutation of (1,..., S) z s {1,..., S} is the index of the latent stick that category s is uniquely mapped to. S! possible permutations 6! = 720; 7! = 5, 040;... ; 10! = 3, 628, 800;... Fortunately, the effective search space could be significantly smaller than S!. We will discuss why and provide examples. 13 / 36

14 logistic Permuted and augmented () logistic Theorem (2) Suppose y i S s=1 p is(z)δ s, where [p i1 (z),..., p is (z)] is a probability vector whose elements are constructed as p is (z) = (π izs ) 1(zs S) (1 π ij ), j<z s then y i can be equivalently generated under the permuted and augmented () construction as y i S s=1 { [1(b izs = 1)] 1(zs S) } j<z s 1(b ij = 0) δ s, b ij Bernoulli(π ij ), j {1,..., S}. S categories are randomly one-to-one mapped to S sticks. {b is } s given {π is } s are mutually independent in the prior. 14 / 36

15 logistic Permuted and augmented () logistic Gibbs sampling for : Sample b is for s {1,..., S}: Category y i is mapped to stick z yi. b is = 0 if s < z yi b is = 1 if s = z yi b is Bernoulli(π is ) if s > z yi Solve b is Bernoulli(π is ) for s {1,..., S}, where the covariate-dependent stick probability π is for the sth category is a deterministic function of x i β s. Sample (z 1,..., z S ), the one-to-one mapping between the category and stick indices, using Metropolis-Hastings. 15 / 36

16 logistic Permuted and augmented () logistic Sample the label-stick one-to-one mapping: Let (z 1,..., z S ) be uniformly at random selected from the S! possible permutations in the prior. Propose to change z = (z 1,..., z j,..., z j,..., z S ) to z = (z 1,..., z S ) := (z 1,..., z j,..., z j,..., z S ). Accept the proposal with probability { S s=1 min [p } iz s ]1(y i =s) S s=1 [p iz s ], 1 1(y i =s) i [ S s=1 (π iz s ) 1(z s S) ] 1(yi =s) j<z (1 π s = min ij ) [ S i s=1 (π izs ) ], 1 1(yi 1(zs S) =s) j<z s (1 π ij ). Proposing two indices z j and z j to switch in each iteration is effective for escaping from the set of poor mappings. The probability of a z j not proposed to switch is [(S 2)/S] t after t MCMC iterations. Even if S = 100, this probability is less than 10 8 at t = S/2 is the expected number of iterations for a z j to be proposed to switch. 16 / 36

17 logistic Permuted and augmented () logistic Model: y i logistic { } S 1(zs S) 1(b izs = 1) 1(b ij = 0) δ s, j<z s ( ) s=1 b is Bernoulli ex i β s π is = 1 + e x i β s, s {1,..., S}. The number of geometric constraints increases in z s. If z s = 1, then p is = ( 1 + e x i β 1) 1 is larger than 0.5 if x i β 1 > 0. If z s = 2, then p is = ( 1 + e x i β 1) 1 ( 1 + e x i β 2) 1 is possible to be larger than 0.5 only if both x i β 1 < 0 and x i β 2 > 0 If z s = 3, then p i3 = ( 1 + e x i 1) β 1 ( 1 + e x i 2) β 1 ( 1 + e x i β 3 is possible to be larger than 0.5 only if x i β 1 < 0, x i β 2 < 0, and x i β 3 > 0. ) 1 17 / 36

18 logistic Permuted and augmented () logistic Sequential decision making and relaxing independence of irrelevant alternative (IIA) assumption A one-vs-remaining decision at each of the stick breaking steps. Lemma Under the construction, the probability ratio of two choices are influenced by the success probabilities of the sticks that lie between these two choices corresponding sticks. In other words, the probability ratio of two choices will be influenced by some other choices if they are not mapped to adjacent sticks. 18 / 36

19 logistic Permuted and augmented () logistic logistic as a discrete choice model The logistic that assigns choice s {1,..., S} for individual i with probability p is = (π is ) 1(s S) j<s (1 π ij), π is = 1/(1 + e W is ), is equivalent to a sequential random utility maximization model which selects choice s once U is > j s U ij is observed, where U i1 = U i2 + + U is + W i1 + ε i1, U is = j>s U ij + W is + ε is, U i(s 1) = W i(s 1) + ε i(s 1), U is = 0, and ε is i.i.d. Logistic(0, 1). 19 / 36

20 Binary M l(β, ν) = Binary n max(1 y i x iβ, 0) + νr(β), where y i { 1, 1} i=1 vector machine 20 / 36

21 Binary M Mixture representation: 0 Bayesian Binary [Polson & Scott (2011)] L(y i x i, β) = exp { 2max(1 y i x iβ, 0) } ( 1 = exp 1 (1 + λ i y i x ) i β)2 dλ i. 2πλi 2 λ i Gibbs sampling for β is available under data augmentation. Decision rule [sollich (2002) and Mallick, et al. (2005)]: 1 P(y i = 1 x i, β) = 1 + e 2y i x i β, for x iβ 1; e y i [x i β+sign(x i β)], for x iβ > 1; 21 / 36

22 Binary M Under the construction, given the covariate vector x i and category-stick mapping z, support vector machine (M) parameterizes p is, the probability of category s, as p is (z) = [π izs,svm(x i, β s )] 1(zs S) j:z j <z s π izj,svm(x i, β j ), where 1 π izj,svm(x i, β j ) = 1 + e 2x, for x i β j i β j 1; e x i β j sign(x i β j ), for x i β j > / 36

23 Binary b is Bernoulli [ K 1 k=1 Equivalently, Softplus [ (2016)] [ (1+e x +1) i β(t sk ln {1+e x i β(t ) sk ln ln θ (T ) isk ( ) Gamma r sk, e x +1) i β(t sk, ( θ (t) isk Gamma θ (t+1) isk, e x i β(t+1) sk ( ) θ (1) isk Gamma θ (2) isk, ex i β(2) sk, ), )]}) ] (1+e x rsk i β(2) sk. b is = 1(m is 1), m is = K k=1 K is supported by the gamma process. m (1) isk, m(1) isk Pois(θ(1) isk ), 23 / 36

24 Properties of binary Binary K > 1 and T = 1: using the interaction of up to K hyperplanes to enclose negative examples K = 1 and T > 1: using the interaction of up to T hyperplanes to enclose positive examples K > and T > 1: using the union of convex-polytope-like confined space to enclose positive examples K and T together control the nonlinear capacity of the model K is supported by the gamma process. 24 / 36

25 Binary K = 20, T = 1 Label Class A as 1 and Class B as 0: Binary 25 / 36

26 Binary K = 1, T = 20 Label Class A as 0 and Class B as 1: Binary 26 / 36

27 Binary K = 20, T = 20 Label Class A as 1 and Class B as 0: Binary 27 / 36

28 Binary (MSR) With a draw from a gamma process for each category that (2:T +1) consists of countably infinite atoms β sk with weights r sk > 0, where β (t) sk RP+1, given the covariate vector x i and category-stick mapping z, MSR parameterizes p is, the probability of category s, under the construction as p izs = [ 1 k=1 (1+e x i β(t+1) sk ln {1+e x i β(t ) [ ( { j:z j <z s k=1 1+e x i β(t+1) jk ln sk [ )]}) ] ln ln (1+e x rsk 1(zs S) i β(2) sk 1+e x i β(t ) jk ln [ ( )]}) ] rjk ln 1+e x i β(2) jk. 28 / 36

29 Permuted and Label blue, black, and gray (middle, bottom, and top) points as classes 1, 2, and 3, respectively. Fix z = [1, 2, 3] for. Row 1: K = 1, T = 1. Row 2: K = 1, T = 3. Row 3: K = 5, T = 1. Row 4: K = 5, T = / 36

30 Permuted and Label blue, black, and gray (inside left, outside, and inside right) points as classes 1, 2, and 3, respectively. with K = T = 10. Row 1: z = [1, 2, 3]. Row 2: z = [2, 1, 3] Row 3: z = [3, 1, 2]. Row 4: sample z during MCMC iterations 30 / 36

31 Model comparison Table: Comparison of classification error rate of -M, MSR with different K and T, L 2 -MLR, and AMM. -M K = 1 T = 1 K = 1 T = 3 K = 5 T = 1 K = 5 T = 3 L 2-MLR AMM square iris wine glass vehicle waveform segment vowel dna satimage ANER / 36

32 Number of active experts in 5 4 parsb active experts square iris wine glass vehicle waveform segment vowel dna satimage Figure: Boxplots of the number of active experts. 32 / 36

33 Permuted and count Likelihood for S! different permulations log likelihood Figure: Histogram of S! = 6! = 720 log-likelihoods for Satimage, using augmented (MSR) with K = 5 and T = 3. The blue line indicates the average log-likelihood of the collected MCMC samples of MSR, with the permutation z sampled via the proposed Metropololis-Hasting step. 33 / 36

34 Softplus with support hyperplanes Figure: Row 1: Softplus with K = 5, T = 3. Row 2: Softplus with K = 5, T = 3 and data transformation of support hyperplanes. 34 / 36

35 Discussions A general framework to transform a binary classifier to a multi-class one. Fully Bayesian inference via data augmentation. The coefficient vectors of different categories can be sampled in parallel in each MCMC iteration. Not invariant to label permutation if the label-stick mapping is fixed. Asymmetric geometric constraints (more constraints for a category mapped to a larger-indexed stick). 35 / 36

36 Thank You! 36 / 36

Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression

Journal of Machine Learning Research 18 (2018) 1-33 Submitted 7/17; Revised 12/17; Published 4/18 Permuted and Augmented Stick-Breaking Bayesian Multinomial Regression Quan Zhang quan.zhang@mccombs.utexas.edu