Domain Adaptation Methods for Robust Pattern Recognition

Size: px

Start display at page:

Download "Domain Adaptation Methods for Robust Pattern Recognition"

Brook Stafford
5 years ago
Views:

1 Domain Adaptation Methods for Robust Pattern Recognition David A. Shaw Department of Mathematics University of Maryland College Park, MD Rama Chellappa Department of Electrical and Computer Engineering and UMIACS University of Maryland College Park, MD Abstract The large majority of classical and modern estimation techniques assume the data seen at the testing phase of statistical inference come from the same process that generated the training data. In many real-world applications this can be a restrictive assumption. We outline two solutions to overcome this heterogeneity: instance-weighting and dimension reduction. The instance-weighting methods estimate weights to use in a loss function in an attempt to make the weighted training distribution look like the testing distribution, whereas dimension reduction methods seek transformations of the training and testing data to place them both into a latent space where their distributions will be similar. We use synthetic datasets and a real data example to test the methods against one another. I. INTRODUCTION In statistical and data analysis problems, we typically assume homogeneity of distributions between training and testing. Often, however, in practical situations, this assumption is violated. This can result in poor predictive performance and present difficulties in determining optimal tuning parameters through cross-validation. For example, consider Figure 1. This dataset, from [1], is composed of both raw images and landmark points of individuals at various ages. In order to preprocess the landmark data, affine transformations are removed, resulting in predictors with a non-euclidean structure [2], [3]. Often, however, testing data will be made available that has had some affine transformation applied to it (such as data obtained in the wild ). It may be the case that an inference problem only has data from the preprocessed source data, but interest lies in estimating a predictive model that will perform well on all potential data. We shall formalize this problem by operating under the assumption that the joint distribution between the response variable (Y ) and covariates (x) changes from training to testing. In domain adaptation [4], we assume that while the covariate distribution might change between two domains, the underlying mechanism to generate the response variables from the covariates does not change. In other words, the conditional distribution of the response given the covariates remains the same across all domains, while the marginal distribution of the covariates may change. These assumptions are also present in covariate shift problems [5]. Transfer learning, another approach to this problem, deals with the problem of the conditional distribution of the response given the covariates shifting between domains while the covariate distribution stays the same [6]. We assume to have independent, identically distributed (i.i.d.) data x 1,..., x n with x i x X R p with corresponding response variables y 1,..., y n for training a model (called the source data) and z 1,..., z m with z j z Z R q with ξ 1,..., ξ m for testing (called the target data ). Our main assumption is that we have unknown response variables ξ 1,..., ξ m from the same model that generated the known response variables (i.e., [Y x] [Ξ z], though x z in general). Our goal is to learn a parameterized conditional model optimal under (Ξ, z) while only knowing a small number of observations from Ξ (or knowing none in the unsupervised case). Our data will typically consist of X = [x 1,..., x n ] T R n p, y = [y 1,..., y n ] T R n, and Z = [z 1,..., z m ] T R m q. On occasion we will be blessed with some labeled examples from the target data, though all methods will be outlined for the unsupervised case, with the semisupervised case being a straightforward extension. II. INSTANCE-WEIGHTING METHODS Instance-weighting methods seek a set of weights to apply to the source data that will transform the distribution of the labeled source data into that of the unlabeled (or partially labeled) target data. This way, transporting a model from the labeled data in a different domain will yield better results. Shimodaira [5] proposed a method for correcting this discrepancy with a view toward improving predictive performance by weighting each element in the source data by an importance weight based on both the source and target density functions. A. Covariate Shift In covariate shift (CS, [5]), we assume the source data x f and target data z g for some density functions f and g. If we operate under the previous assumption (namely, that [Y x] [Ξ z] while x z), we fix a parameterization for the conditional distribution and define the Kullback-Leibler loss for the source data as L x (β) = f(x) q(y x) log p(y x, β) dy d x (Y,X ) and for the target data as

2 Fig. 1. Sample images from FG-NET originals (top), landmark points (middle), and rotated/reduced landmark points (bottom). L z (β) = g(z) q(ξ z) log p(ξ z, β) dξ d z. (Ξ,Z) We shall assume for the time being that (Y, x) and (Ξ, z) have the same support so that the loss functions L x and L z differ only through the marginal distributions f and g. Let L (n) w (β; X, y) = w(x i ) log p(y i x i, β) so that, for w(x) 1, L (n) w /n L x as n. Since we desire the loss for the target data z, we take w to be the importance weights w(x) = g(x)/f(x). In this case, /n L z as n, which is what is needed. This method is not always optimal. First, it typically will only yield improvements under a misspecification of the model (i.e., in the case when p(y x, β) is different from the true model), though for high-dimensional data even standard methods for misspecified models perform similar to the importanceweighted method [7]. Additionally, density estimation is a nontrivial problem in high dimensions, which is often the case in problems of interest. Density estimates can still be made by, for example, using radial kernels (see Section IV), though accurate estimates of the density are still difficult in this case. Even for one-dimensional problems, CS requires that supp(z) supp(x ), a restriction that can hinder the types of problems we will consider. L (n) w B. Kernel Mean Matching Kernel mean matching (KMM, [8]) seeks to generalize CS by incorporating the weights w into an objective function and optimizing over the data to find optimal weights instead of taking them as known. CS is further generalized by defining a kernel function Φ : X F and assuming still that X = Z. An expectation operator µ is defined as µ(f ) = E F [Φ(x)] where F is the cdf corresponding to the density f. The KMM procedure seeks to solve minimize w µ(g) E F [w(x)φ(x)] subject to w(x) 0 and E F [w(x)] = 1 To find w R n, we will define an empirical version of the above objective function and incorporate constraints w i [0, W ] and n n w i nɛ, analogous to the two constraints above. This empirical objective will be written as where J(w) = 1 2 wt K w k T w K ij = k(x i, x j ), k i = n m m k(x i, z j ) j=1

3 for an appropriate kernel function k(, ). The constrained minimization of J above using the empirical constraints can be solved by any quadratic programming optimization (e.g., an interior point method [9]). In [8], k is taken to be the Gaussian kernel k(u, v) = exp{ σ u v 2 } for a fixed σ. Choice of ɛ will be governed by the following result. Huang, et. al [8] showed that if w(x) [0, W ] for all x X, given x 1,..., x n i.i.d. from F, 1 n w(x i ) D N(µ w, σw) 2 where µ w = X w(x) df (x) and σ w W/2 n. This indicates that we should take ɛ = O(W/ n) for some fixed constant W. KMM is able to overcome some of the drawbacks from CS. Since weights are estimated directly, density functions no longer need to be defined explicitly (or estimated), reducing potential errors, especially in multivariate problems. Moreover, the kernel approach allows for capturing possible nonlinearities that might improve predictive performance. Unfortunately, we still have the restriction that the source and target must have the same support, and in fact still need supp(z) supp(x ) as in CS. Additionally, since a kernel approach is used, weights are only defined for the training input points so that if we desire weights for points that were not available initially (e.g., for cross validation), we will need to rerun the entire optimization. C. Kullback-Leibler Importance Estimation The Kullback-Leibler importance estimation procedure (KLIEP, [7]) attempts to improve on KMM by specifying a model for the weight function w(x) so that weights can be obtained for points not available at training and estimating them using the given data. We wish to model ĝ(x) = ŵ(x) f(x), ŵ(x) = b α l φ l (x) l=1 where {α l } are unknown parameters to be estimated and φ l 0 are fixed basis functions. The weight function ŵ will be chosen as to minimize D KL [g(x) ĝ(x)] = g(x) log g(x) f(x) d x X X g(x) log ŵ(x) d x. Since the second term is the only one with the parameters of interest, we form the loss function as J(α) = 1 m m log j=1 b α l φ l (z j ). l=1 Unfortunately J is concave so we require constraints to successfully optimize over α. First, we desire ŵ(x) 0, which we specify as α l 0 for l = 1,..., b. Additionally, since ĝ(x) = ŵ(x)f(x) should be a proper density function, we have 1 = ŵ(x) f(x) d x 1 X n l=1 b α l φ l (x i ). In practice, we will define basis functions using the target data. Here, ŵ(x) = m α l k(x, z j ) j=1 where k(, ) is the Gaussian kernel as before. The estimation of α will be done through gradient ascent of the negative of J above. We will define K ij = k(z i, z j ), k j = 1 n k(x i, z j ) as in KMM, except this time with the target data instead of the source data. In this case, the parameter update uses gradient ascent with K and constraint satisfaction using k to ensure the α are properly normalized. It can be shown that, when the Gaussian kernel is used at the test input points to estimate the weight function, KLIEP converges to the optimal solution at a rate slightly slower than O p (n 1/2 ), assuming n = m. KLIEP improves over CS and KMM in obtaining a general weight function that can be applied to points that are not available at training time. The method can be computationally intensive at times, especially when a large number of target samples are available. In this case, subsampling or clustering can be done to reduce computational cost in the estimation of the basis functions. KLIEP has similar issues to CS and KMM in that we still require X = Z, and further require F and G to be mutually absolutely continuous. III. DIMENSION REDUCTION METHODS Much of the importance weighting methods have similar drawbacks, namely that they have issues handling cases in which source and target data not only come from different distributions, but perhaps even have differing underlying structures, indicating that X = Z or supp(z) is not contained in supp(x ). In this case, transformations of the source or target data (or typically both) are desired to obtain a representation that is hopefully invariant to domain changes. Many dimension reduction methods require knowledge of the structure of the Grassmannian G(p, d). We think of this space intuitively as the set of all d-dimensional subspaces of R p, i.e., the quotient space G(p, d) = R(p, d)/

4 where R(p, d) is the space of all p d matrices of rank d, and, for U, V R p d, U V if there exists a nonsingular L R d d such that V = U L [10]. Elements of the Grassmannian are often identified by bases for their particular subspaces, e.g., as η R p d with η T η = I d. A. Intermediate Subspaces The intermediate subspace approach (IS, [11]) seeks a latent feature representation by obtaining intermediate feature spaces that help to quantify the shift from the source to the target space. In the method, the latent variables are obtained by sampling points along a geodesic on the Grassmannian G between the subspace spanned by the source dataset and the subspace spanned by the target dataset. It can be written as follows: Using principal component analysis, learn d- dimensional (d < p = q) representations of data X and Z as X = X η x and Z = Z η z for η x, η z G(p, d), and, using the geodesic along G(p, d) between η x and η z, obtain intermediate transformations η 1,..., η K to use in finding representations X X 1 X K+1 and Z ẐK Ẑ0 where X K+1 = X η z and Ẑ0 = Z η x. The newly acquired latent feature representation for the source data X is then just the concatenation of each of these matrices, which can be expressed as X * = [ X XK+1 ] with a similar representation being acquired for data Z. Partial least squares [12], hereafter called PLS, is performed to obtain a low-dimensional model operating on these expanded datasets. Some drawbacks to the IS method are its reliance on a large number of tuning parameters and the high dimensionality that must be overcome when many subspaces are desired. B. Geodesic Flow Kernel Geodesic flow kernel (GFK, [13]) seeks improvements over the IS method. GFK attempts to remove the need for sampling along the geodesic between the source and target subspaces and uses a kernel approach to mitigate the extreme dimensionality of IS when a large number of subspaces are used. As before, we have η x, η z R p d as bases for the source and target subspaces, respectively, and write their orthogonal complements as η x, η z R p (p d). We write the geodesic flow γ : [0, 1] G(p, d) as γ(t) = η x U 1 Γ(t) η x U 2 Σ(t) where U 1, U 2, Γ, Σ are given by the generalized singular value decomposition η T x η z = U 1 Γ V T, (η x ) T η z = U 2 Σ V T. We define Γ(t) and Σ(t) as diagonal matrices with cos(tθ i ) and sin(tθ i ) on the diagonal where Γ and Σ are diagonal matrices with cos(θ i ) and sin(θ i ) on the diagonal for i = 1,..., d. Our goal is to use all t (0, 1) to obtain representations γ(t) T x for x in the source domain. Computationally this is infeasible, so we proceed through a kernel approach where < Φ(u), Φ(v) >= 1 0 (γ(t) T u) T (γ(t) T v) dt = u T G v with G R p p positive semidefinite and defined through matrices obtained using the previous generalized singular value decomposition. This kernel is used to perform prediction through using kernel nearest neighbor. Potential drawbacks to both IS and GFK are that the geodesic path between two subspaces may not be the most informative, especially when further labeling information is available. Moreover, all of the issues with kernel methods that were outlined previously hold for GFK as well. C. Regularized Likelihood Directions Regularized likelihood directions (RLD, [14]) approaches the domain adaptation problem through sufficient dimension reduction [15] we seek a transformation in which η is estimated so that P (Y A η T x) = P (Y A x) for some set A. Many of the previous dimension reduction approaches can be reinterpreted as enforcing normality constraints on the source and target data [16]. Here, we make a slightly more relaxed assumption that Y takes values in {1,..., C} and x y N(µ x y, Σ x y), y = 1,..., C, where n y observations x y come from X with response variable y. Note that we write cov(x) = Σ x. Likelihood-acquired directions (LAD, [17]), seek to maximize L(η) = 1 2 log ηt Σ x η 1 2 C y=1 n y n log ηt Σ x y η over all η such that η T η = I d, where L(η) is proportional to a likelihood function. This is done through conjugate gradient ascent on G(p, d), as described in [18]. RLD proceeds in a similar fashion to KMM in that it attempts to keep the means of the source and target distributions close, this time withinclass. To do this, we define Γ λ (η, γ) = λ 2 K [ 1{ly 1} η T µ x y γ T µ z y 2] y=1 for a fixed λ > 0 as a regularization term to be included into the overall maximum likelihood framework. Here we let γ R q d as we do not necessarily have p = q. However, if p = q, letting γ = η often results in improvements because we estimate a smaller number of parameters from the same data. Since often we only have a small number of realizations from z, second moments will not be constrained. Our goal then becomes to solve arg max (η,γ) L(η) + Γ λ (η, γ).

5 While this method is defined as requiring within-class means from the target distribution, we can adapt it to match only global means in the unsupervised case. IV. EXPERIMENTS We test the outlined methods against each other (as well as a baseline method) using synthetic and real datasets. The real data considered will be the age estimation problem given in Figure 1, while the synthetic dataset is outlined below. The baseline method used is simply a least squares classifier using PCA with no adaptation. Both unsupervised and semisupervised problems will be considered. All methods were outlined for the unsupervised case. In the semisupervised case, the data used for estimating the model is augmented with the given labeled target data; for all methods this is also included as the target data used to estimate the weights or transformation. For CS, we learn density functions using a radial kernel density estimation procedure with a Gaussian kernel, i.e., ˆf n (x) = 1 1 n H K [ H -1 (x i x) ], K(u) exp [ 12 ] u 2 where H is a diagonal bandwidth matrix, with each diagonal element chosen according to the rule of thumb ĥj = n 1/(p+4)ˆσ j with ˆσ j the estimated standard deviation for column j of X [19]. A similar density estimate is obtained for g. We choose a Gaussian kernel over a more efficient kernel (e.g., Epanechnikov) due to its infinite support. For KMM, we take ɛ = 1 n 1/2 and W = 1000 as in [8], but set σ = 1e 4 as it yielded better results in practice than the authors suggestion of σ = 1e 1. For KLIEP, we take ɛ = 1e 3 and σ = 1e 1. For IS, we set the number of intermediate subspaces to 8, and for RLD we set λ = 1 for the simulation and λ =.2 for the real-world data. These tuning parameters were found to give the best results. More intelligent parameter selection can be done, e.g., in [20] for regularization parameters or [21] for bandwidths. Other types of cross-validation can be performed for all methods, though for many of them this results in a large amount of computational expense. For RLD, we use the sgmin implementation provided by Lippert and Edelman [22]. To easily incorporate weights and to reduce computational complexity a one-vsall least squares classifier is estimated for all methods except GFK, which uses kernel nearest neighbor. For the baseline and instance-weighting methods, PCA is first performed to reduce the effect of the high dimensionality. For the case where p q, features from the higher-dimensional space are transformed into the lower-dimensional space using PCA for all methods except RLD. A. Simulation Studies For the simulation studies, we generate 200 observations in R 6 for the source data and 300 observations in R 4 for the target data in three classes. We generate the source data as multivariate normal with zero mean and covariance matrix with Σ ij = 0.5 i j for i, j = 1,..., 6. (1) The target data is generated as a mixture of two normals, one with a mean vector of ones and covariance matrix Σ ij = 0.5 i j for i, j = 1,..., 4, (2) the other with a mean vector of negative ones and covariance matrix Σ ij = i j for i, j = 1,..., 4, (3) each with equal weight. We generate both η and γ with random normal entries in R 6 2 and R 4 2, respectively, and take their orthogonalizations. The labels are generated as y = η T 1 x η T 2 x +ɛ (4) where ɛ N(0,.5 2 ). The response values y are then discretized into three categories by thresholding them at their one-third and two-thirds quantiles. Similar labels are generated for the target data using γ. Each study is run ten times, and the average misclassification rate is recorded for all methods under consideration, along with the standard deviation. For the unsupervised case, we use half of the source data and half of the target data to build the model, while testing the model on the remaining half of both the source and target data. For the semisupervised studies, we build the model again on half of the source data, this time selecting three observations per class from the target data to use in parameter estimation. We then test this model on the remaining half of the source data and all 300 of the target data points. Misclassification rates are given in Table I with standard deviations given in parentheses. It seems there is no clear cut best method on either the source or the target. RLD looks to be a clear winner for all cases except semisupervised same-dimension on the source data and unsupervised differing-dimension on the target data. In the first of these exceptions all methods seem to perform equally well, with KLIEP performing best; in the second, again all methods yield similar results, this time with KMM coming out ahead. The good performance of RLD is partly due to the fact that the underlying model was linear in its arguments. Nonlinear models require more complicated techniques [14]. Note that most methods saw performance drops when moving from source to target, with most of the top methods performing better than the baseline on the target data, an indication that accounting for the domain shift is a worthwhile approach. B. Real-World Data We will consider an age estimation problem from the data at [1] described in the Introduction (see Figure 1). For all methods we first apply the inverse exponential map. This map sends predictors under consideration to the tangent space about

6 Same Dimension Different Dimension Unsup/Source Unsup/Target Semisup/Source Semisup/Target Baseline (7.48) (21.47) ( 9.02) (30.14) CS (6.65) (21.83) (11.58) (28.56) KMM (6.62) (22.04) (14.15) (25.65) KLIEP (7.55) (21.43) ( 8.97) (30.25) IS (8.19) (18.24) ( 9.73) (30.69) GFK (9.96) (14.60) (12.58) (23.55) RLD (3.88) (17.61) (10.72) (14.91) Baseline ( 8.52) (29.26) (6.15) (20.85) CS ( 8.76) (29.78) (9.01) (20.03) KMM ( 8.71) (28.85) (8.21) (20.35) KLIEP ( 8.24) (29.34) (6.21) (20.94) IS (10.15) (27.42) (8.06) (12.51) GFK (13.00) (16.31) (8.74) ( 8.20) RLD ( 4.02) (28.28) (4.38) (13.09) TABLE I MISCLASSIFICATION RATES FROM SIMULATION STUDIES. STANDARD DEVIATIONS ARE GIVEN IN PARENTHESES. the manifold on which these predictors lie, which is in turn normal Euclidean space. See [23] for numerical methods to compute this map. All results are given for the semisupervised case. The source data will be the original landmark points, while the target data consists of source landmark points rotated at a randomly sampled angle betwen 0 and pi/4. We sample with replacement 250 observations per class from the source data and 50 observations per class from the target data to build the model, while testing on the remaining source and target data. Here, classes are obtained by thresholding the continuous response variables at their median (obtained from the labeled source data). This categorical information is then used to estimate transformations in the RLD framework. The square root of the age variable is used as this guarantees nonzero estimates and has been shown to improve performance [16]. As in the simulations, the sampling is done ten times, with the average mean absolute error recorded, along with the standard deviation. Results are given in Table II, where the estimated dimension is taken to be both 8 and 16. In the same-dimension unsupervised cases, the dimension reduction methods come out ahead, being the only methods to obtain significantly better results than the baseline method. A similar scenario occurs in the different dimension case, in which we remove half of the landmark points at random. However, here the only method to perform significantly better than the baseline is the RLD method. GFK performs much worse in this case, mainly due to the fact that age estimates are obtained using a nearest neighbor method as opposed to a regression model. In all cases, CS performs poorly. Improvements to this method might be possible if more advanced, manifold-valued density estimates were considered on the original data as opposed to the transformed data that was used. V. CONCLUSION When the distribution of features is heterogeneous across training and testing sets, modifications can be made to aid in prediction. We outlined two different paradigms: instance weighting methods in which each observation is assigned a weight to even out distributional differences, and dimension reduction methods in which feature spaces are sought that minimize the discrepancy between training and testing sets. There have been many approaches within each of these frameworks. RLD works well when the underlying model is linear, and seems to perform best overall on a structured regression problem, with IS and GFK performing similarly well, at least when source and target dimensions are the same. In any case, using some method to account for a domain shift can go a long way to improving predictive performance in pattern recognition tasks. ACKNOWLEDGMENT This work was partially supported by a MURI from the Office of Naval Research under the grant REFERENCES [1] FG-NET Aging Database, Accessed Apr 2011, Face and Gesture Recognition Research Network, Available [2] C. Goodall and K. Mardia, Projective Shape Analysis, Journal of Computational and Graphical Statistics, vol. 8, no. 2, [3] P. Turaga, S. Biswas, and R. Chellappa, The Role of Geometry for Age Estimation, in IEEE International Conference on Acoustics, Speech and Signal Processing, March 2010, pp [4] H. Daumé III and D. Marcu, Domain adaptation for statistical classifiers, Journal of Artificial Intelligence Research, vol. 26, no. 1, pp , [5] H. Shimodaira, Improving Predictive Inference Under Covariate Shift by Weighting the Log-Likelihood Function, Journal of Statistical Planning and Inference, vol. 90, no. 2, pp , Oct [6] S. J. Pan and Q. Yang, A survey on transfer learning, IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp , [7] M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. von Bünau, and M. Kawanabe, Direct importance estimation for covariate shift adaptation, Annals of the Institute of Statistical Mathematics, vol. 60, no. 4, pp , [8] J. Huang, A. J. Smola, A. Gretton, K. Borgwardt, and B. Schölkopf, Correcting sample selection bias by unlabeled data, Advances in neural information processing systems, vol. 19, p. 601, [9] Y. Nesterov and A. Nemirovskii, Interior-Point Polynomial Algorithms in Convex Programming. Philadelphia, PA: SIAM Studies in Applied and Numerical Mathematics, [10] Y. Chikuse, Statistics on Special Manifolds, ser. Lecture Notes in Statistics. Springer, [11] R. Gopalan, R. Li, and R. Chellappa, Domain Adaptation for Object Recognition : An Unsupervised Approach, International Conference on Computer Vision, 2011.

7 Same Dimension Different Dimension d = 8/Source d = 8/Target d = 16/Source d = 16/Target Baseline 7.35 (0.18) (1.28) 7.06 (0.31) (1.41) CS (0.81) (0.71) (0.51) (0.45) KMM 7.35 (0.18) (1.28) 7.06 (0.31) (1.41) KLIEP 7.33 (0.18) (1.28) 7.04 (0.31) (1.41) IS 7.35 (0.24) 7.55 (0.25) 6.86 (0.33) 6.88 (0.20) GFK (0.21) 7.72 (0.91) (0.29) 7.07 (0.52) RLD 6.91 (0.20) 7.38 (0.44) 6.78 (0.23) 7.07 (0.22) Baseline 7.24 (0.22) (1.31) 6.95 (0.37) 9.72 (1.68) CS (0.64) (0.50) (0.67) (0.54) KMM 7.24 (0.22) (1.31) 6.95 (0.37) 9.72 (1.68) KLIEP 7.23 (0.22) (1.30) 6.94 (0.36) 9.69 (1.67) IS 7.15 (0.17) (0.77) 6.91 (0.27) 9.57 (0.64) GFK (0.45) (1.63) (0.30) (1.57) RLD 7.10 (0.19) 9.44 (1.06) 7.05 (0.43) 8.45 (0.70) TABLE II MEAN ABSOLUTE ERRORS FROM AGE ESTIMATION. STANDARD DEVIATIONS ARE GIVEN IN PARENTHESES. [12] H. Wold, Partial Least Squares, in Encyclopedia of Statistical Sciences, S. Kotz and N. Johnson, Eds. New York: Wiley, [13] B. Gong, Y. Shi, F. Sha, and K. Grauman, Geodesic flow kernel for unsupervised domain adaptation, IEEE Conference on Computer Vision and Pattern Recognition, [14] D. A. Shaw and R. Chellappa, Sufficient dimension reduction for domain adaptation, 2014, under review. [15] R. Cook and K. Adragni, Sufficient dimension reduction and prediction in regression, Philosophical Transactions of the Royal Society A: Physical, Mathematical and Engineering Sciences, vol. 367, no. 1906, pp , [16] D. A. Shaw and R. Chellappa, A Dimension Reduction Method for Improving Predictive Performance in Regression and Classification, Journal of the American Statistical Association, 2014, under review. [17] R. D. Cook and L. Forzani, Likelihood-based sufficient dimension reduction, Journal of the American Statistical Association, vol. 104, no. 485, pp , [18] A. Edelman, T. Arias, and S. Smith, The geometry of algorithms with orthogonality constraints, SIAM Journal on Matrix Anaysis and Applications, vol. 20, pp , [19] D. Scott, Multivariate Density Estimation: Theory, Practice, and Visualization. New York, Chichester: John Wiley & Sons, [20] G. H. Golub, M. Heath, and G. Wahba, Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter, Technometrics, vol. 21, no. 2, pp , May [21] C. R. Loader, Bandwidth Selection: Classical or Plug-In? The Annals of Statistics, vol. 27, no. 2, pp , [22] R. Lippert and A. Edelman, SGMIN, bai/et/other methods/ overview SGMIN.html. [23] K. Gallivan, A. Srivastava, X. Liu, and P. V. Dooren, Efficient algorithms for inferences on grassmann manifolds, in In Proceedings of 12th IEEE Workshop on Statistical Signal Processing, 2003, pp

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering