Transductive Experiment Design

Size: px

Start display at page:

Download "Transductive Experiment Design"

Sheena Daniels
6 years ago
Views:

1 Appearing in NIPS 2005 workshop Foundations of Active Learning, Whistler, Canada, December, Transductive Experiment Design Kai Yu, Jinbo Bi, Volker Tresp Siemens AG Munich, Germany Abstract This paper considers the problem of selecting the most informative experiments x to get measures y for learning an inference model y = f(x). We propose a novel concept for active learning, transductive experiment design, to overcome the shortcomings of existing experiment design methods, e.g. insufficient exploration of available unmeasured data and poor scalability for large data sets. In-depth analysis clearly shows that the method tends to favor experiments that are hard to predict and meanwhile typical in representing remaining hard-to-predict data. Efficient solutions are further developed through mathematical programming techniques. Encouraging results on toy problems and real-world data sets are included to highlight the advantages of the proposed approaches. 1 Experiment Design The problem of active learning is often referred as experiment design in statistics (see [1, 2]). Formally, in order to learn a function f(x) = w x, w R d, one has to take measurements or experiments y i = w x i + ɛ i, i = 1,..., m, where ɛ i N (0, σ 2 ) and y i N (w x, σ 2 ). Let x 1,..., x m in experiments be chosen among n possible test data v 1,..., v n R d, n > m. The goal of experiment design is to choose m vectors x i, from among the possible choices, so that the estimation error is small. In other words, the task is to find a set of data x i that together are maximally informative. arg min m ( ) w w 2 x i y i gives the maximum-likelihood estimate of w. The estimation error ŵ w has zero mean and covariance matrix C w = σ 2 (X X) 1 [2]. 1 The matrix C w characterizes the accuracy of the estimation, or the informativeness of the experiments. Let m j denote the number of experiments for which v j is chosen in X, where m m n = m. The so-called A-optimal design minimizes the trace of C w, namely 1 In the rest of this paper, we use X to represent both the matrix [x 1,..., x m] R m d and the index set {x i}, and V to represent both [v 1,..., v n] R n d and the index set {v i}. Their meanings will be clear in contexts.

2 minimize Tr[( n j=1 mjvjv j ) 1 ], subject to m j 0, m m n = m, m i Z where σ 2 is removed from the objective function since it is a constant. The integer constraint on m i can be relaxed. There are also other variants of experiment design problems like D-optimal design and E-optimal design. All of them are semidefinite programming problems (SDP) and can be explained to find a minimum ellipsoid to include data V. Their difference lies in how to measure the size of an ellipsoid. 2 Transductive Experiment Design Classical experiment design 2 has some shortcomings. First, the optimization criteria based on C w are indirect indicators of functions qualities. Since the learnt function will be used to make predictions on future test data, it is more desired to directly assess the prediction quality on test data; Second, minimizing the variance of w amounts to minimize the variance of w x in the entire x space. However, if one is only interested in predicting well on data non-uniformly distributed, it is unnecessary or even harmful to apply the classical experiment design; Third, the number of experiments is implicitly required to be no less than the dimensions of input data x i, because when m < d, the matrix X X is not invertible. The problem becomes serious if the dimensionality of input data is thousands but the budget can only afford a few experiments. Finally, classical experiment design has to solve a semidefinite programming (SDP) problem, which is often very slow when dealing with hundreds of data points. In order to overcome all the shortcomings, we perform experiment design in a transductive setting, where the focus is on the predictive performance on test data given beforehand. 2.1 General Transductive Experiment Design A general setting may consider a different set T of test points besides experiment candidates V. For simplification we suppose that the two sets are the same, without loss of generality. Let us assume w follow a Gaussian distribution N (0, ν 2 I) a priori, where I is a d d identity matrix. Based on the training examples {x i, y i} observed from experiments y i = f(x i) + ɛ i, ɛ i N (0, σ 2 ), the function weights w are estimated by min m ( ) w w 2 x i y i + µ w 2 where µ = σ2 > 0 and is the ν vector 2-norm. Following a similar procedure as before, the covariance 2 of the estimation error for w is computed as C w = σ 2 (X X + µi) 1, where X X + µi is always full-rank. Let f = [f(v 1),..., f(v n)] be the function values on all the available data V, it is easy to know that the prediction error has the following covariance matrix C f = E[(f ˆf)(f ˆf) ] = VC wv. In contrast to C w in the classical design, C f directly characterizes the quality of predictions on the target data. Applying the Woodbury inversion identity, it is not difficult to see that minimizing the trace of C f can be formulated as [ maximize Tr VX (XX + µi) 1 XV ] (1) subject to X V, X = m where X = m restricts the number of chosen experiments to be m. VX (XX + µi) 1 XV plays a key role in our following discussion. The matrix Theorem 2.1 Let χ( ) be the projection onto a subspace in R d which is orthogonal complement to span(x 1,..., x m). The objective function in (1) is equivalent to ϕ(v i) 2 = v i 2 χ(v i) 2 ψ(v i) 2, (2) with ψ(v) = m j=1 µ λ j +µ hjh j v, ϕ(v) = m j=1 λj λ j +µ hjh j v, where h 1,..., h m λ 1... λ m 0 are respectively the eigenvectors and eigenvalues of X X. 2 In the rest of this paper, we often refer existing experiment design as classical experiment design. and

3 Due to the space limitation, we omit all the proofs. Details are given in a longer version [5]. We have a few comments for the transductive experiment design: Since ϕ(v i) 2 is upper bounded by v i 2, those v i with a bigger v i have the higher potential to produce a very large ϕ(v i) 2. Therefore those selected experiments X forming ϕ( ) should include those v i with a bigger v i. Since E[f(v i) 2 ] = v i E(ww )v i = v i 2 indicates that the norm v i 2 encodes the prior uncertainty of functions on v i, transductive experiment design tends to select those experiments with uncertain outcomes; On the other hand, maximizing Tr(ZZ ) indicates that n χ(vi) 2 should be small, namely V s projections onto the orthogonal complement subspace should be as small as possible. Therefore, essentially the optimization problem tries to find the optimum set X of experiments that retains the information of V in span(x 1,..., x m) as much as possible. That means, transductive experiment design tends to select those experiments that are representative in V; Due to the regularization, minimizing ψ(v i) 2 implies that V should be more correlated with X X s leading eigenvectors. Therefore, transductive experiment design tends to select experiments X whose significant patterns capture the information of V. Transductive experiment design combines the three criteria in a unified framework. In some sense, classical experimental design only considers the first criterion, since it picks up those experiments that are on the surface of the minimum volume ellipsoid and thus faraway from the origin (i.e. with a big norm). The key contributor to the second criterion is the idea of transduction, namely, only focusing on the predictions on the target cases V. The third criterion can be seen as a refinement to the second one caused by the effect of regularization. Following a terminology similar to the classical experiment design, we call the problem Eq. (1) as A-optimal transductive design 3. Now we are ready to handle experiment design with nonlinear functions, by introducing the kernelized version (see details in [5]). 3 A-optimal Transductive Experiment Design Various design strategies can be employed to conduct transductive experiment design. We give examples on how we establish A-optimal design solutions. Please consult [5] for more complete discussions. A-optimal design needs to solve a difficult combinatorial optimization problem when m > 1. Fortunately, it has an equivalent formulation as follows: minimize subject to π i q i K vx c i 2 + µπ i c i 2 (3) X V, X = m, C = [c 1,..., c n ] R n m where Q = [q 1,..., q n ] and π 1..., π n are the eigenvectors and eigenvalues of K = VV, and K vx = VX. Then clearly, A-optimal design seeks a subset of m experiments X so that the best approximation of leading eigenvectors of K can be constructed using X. Instead of minimizing the quadratic loss as in Eq(3), many recent works [3, 6] have shown that the absolute deviation loss together with the 1-norm regularization is equally suitable to learning inference models, and resulting 3 One can also consider other variants of the objective function, such as minimizing the 2-norm of C f (see [5]. In this paper we mainly focus on the A-optimal transductive design.

4 linear programs can be efficiently solved. We design a novel algorithm that aims to best approximate leading eigenvectors of K in terms of the absolute-deviation loss and the 1-norm regularization condition to enhance scalability. min β 0 min αi n π i q i KBα i 1 + µπ i Bα i 1 + γ β 1 (4) where B is an n n diagonal matrix with its j-th diagonal element equal to β j {0, 1} indicating whether or not an according experiment will appear in X. An alternating optimization procedure is applied to develop an iterative algorithm which performs two major steps at each iteration. The first step fixes B, converts K KB, and solves the following problem for optimal α i, ) n min αi,ξ i,s i (e ξ i + µπ i β s i s.t. πi q i Kα i ξ i, Kαi π i q i ξ i, s i α i s i, i = 1,, n. (5) The second step fixes α i to the above solution, converts K i K diag(α i ), and solves the following problem for optimal ˆβ, n min βi,ξ i,s i e ξ i + γe β s.t. πi q i K i β ξ i, K i β (6) π i q i ξ i, i = 1,, n, β 0. Note that both problems (5) and (6) are linear programs (LP), and can be solved efficiently. We hence call it LP A-optimal algorithm. Further, problem (5) can be de-coupled to optimize each α i separately by minimizing e ξ i + µπ i β s i with constraints π i q i Kα i ξ i, Kαi π i q i ξ i, s i α i s i. These n subproblems are very small and hence scalable. We also derive a greedy algorithm that sequentially selects m experiments based on the following results: Let experiments X be formed by two disjoint sets X 1, X 2 V and C f X denote the predictive variance matrix C f given X, we have 1 ν 2 C f X = K (1) K (1) (K (1) xx + µi) 1 K (1) (7) where K (1) = 1 ν 2 C f X 1. The sequential A-optimal algorithm repeats the following two steps until m experiments have been selected. (1) Select x V with the highest k(x, ) 2 /k(x, x), and add x into X, where k(x, ) is x s corresponding column in current K; (2) Update K K K v,x (K x,x +λi) 1 K x,v based on current X. As in other scenarios, this greedy approximation method demonstrated a high efficiency in our empirical study. 4 Experiments 4.1 Toy Problem: Four Gaussians We generated a toy problem with four Gaussian components in 2-dimensional space, as shown in Fig. 1-(a), and tested experiment design with m = 4. Classical experiment design attempts to reduce the predictive variance with respect to the entire input space. In Fig. 1-(b) the shadow contours (the darker, the lower the variance is) caused by selected specific experiments cover a large region of input space where no data exist. In contrast, as shown in Fig. 1-(c) and (d), the two variants of A-optimal transductive design approaches, Algorithm 3 and 2, make efforts to reduce the predictive variance of targeted data (experiments), thus select almost the cluster centers.

(a) the data (b) classical design (c) sequential transductive design (d) transductive design Figure 1: Experimental design (m = 4) on a toy problem with four Gaussian components.

variance). The both transductive design methods present better results than the classical design. 4.

5 (a) the data (b) classical design (c) sequential transductive design (d) transductive design Figure 1: Experimental design (m = 4) on a toy problem with four Gaussian components. The big red triangle markers indicate the selected data points, gray levels and contours indicate the predictive variance of the learnt function in the entire input space (darker means lower variance). The both transductive design methods present better results than the classical design. 4.2 Text Categorization: Newsgroup and RCV1 Data Sets In this subsection we validate our proposed experiment design approaches on the supervised text categorization task, with two data sets, Newsgroup corpus and RCV1 corpus. We solved two-class classification problems that conducted oneagainst-all scheme for each category. The data points selected by our transductive design, as well as their labels +1 or 1, are used by a kernel ridge regression method with linear kernels. The learning model has shown the state-of-art performance for text categorization. We also examined the performance of random sampling for linear ridge regression and active learning with SVMs. The SVM used in this study is the algorithm described in [4] that selects data points the closest to the decision boundary. As a matter of fact, we cannot run classical A-optimal design on the text categorization problem, since our solver for SDP is not scalable to the problem. The results are shown in Fig. 2. The two transductive experiment design methods consistently and significantly outperform other methods in comparison on both data sets. For example, the classification accuracy based on just 10 selected training examples achieved the mean AUC score 90.2% on Newsgroup and 74.0% on RCV1, in contrast to 77.0% and 64.9% achieved by random sampling. The error bars of transductive design are also much smaller than those compared methods. Furthermore, on both data sets non-sequential transductive design outperforms sequential solutions, which confirms the sequential greedy solution is less optimal than nonsequential version. The advantage of the non-sequential solution over the sequential one is particularly apparent for Newsgroup data.

6 Figure 2: Text categorization accuracy (AUC score) based on training data selected by different methods, on (left) Newsgroup data set and (right) RCV1 data set Interestingly, active learning using SVMs performs worse than random sampling on Newsgroup, while outperforms random sampling on RCV1. The results can be interpreted by the facts that Newsgroup exhibits a clear clustering structure. SVM active learning tends to select data points near classification boundary, which can easily find outliers when a strong cluster structure exists. In other words, SVM active leaning might be unsuitable to data sets, like Newsgroup, that have a strong cluster structure. References [1] Atkinson, A. C. and Donev, A. N. Optimum experiment designs. Oxford Statistical Science Series. Oxford University Press, [2] Boyd, S. and Vandenberghe, L. Convex Optimization. Cambridge University Press, [3] Tibshirani, R. Regression selection and shrinkage via the lasso. Journal of the Royal Statistical Society Series B, 58(1): , [4] Tong, S. Active Learning: Theory and Applicaitons. Ph.D. thesis, Stanford University, [5] Yu, K., Bi, J., and Tresp, V. Active Learning via Transductive Experiment Design. Siemens AG, Submitted. [6] Zhu, J., Rosset, S., Hastie, T., and Tibshirani, R. 1-norm support vector machines. In S. Thrun, L. Saul, and B. Schölkopf, eds., Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.

Active Learning via Transductive Experimental Design

Active Learning via Transductive Experimental Design Kai Yu kai.yu@siemens.com Siemens, Corporate Technology, Otto-Hahn-Ring 6, Munich 81739, Germany Jinbo Bi jinbo.bi@siemens.com Siemens, Medical Solutions, 51 Valley Stream Parkway, Malvern PA 19355, USA