Domain Adaptation Methods for Robust Pattern Recognition

Size: px
Start display at page:

Download "Domain Adaptation Methods for Robust Pattern Recognition"

Transcription

1 Domain Adaptation Methods for Robust Pattern Recognition David A. Shaw Department of Mathematics University of Maryland College Park, MD Rama Chellappa Department of Electrical and Computer Engineering and UMIACS University of Maryland College Park, MD Abstract The large majority of classical and modern estimation techniques assume the data seen at the testing phase of statistical inference come from the same process that generated the training data. In many real-world applications this can be a restrictive assumption. We outline two solutions to overcome this heterogeneity: instance-weighting and dimension reduction. The instance-weighting methods estimate weights to use in a loss function in an attempt to make the weighted training distribution look like the testing distribution, whereas dimension reduction methods seek transformations of the training and testing data to place them both into a latent space where their distributions will be similar. We use synthetic datasets and a real data example to test the methods against one another. I. INTRODUCTION In statistical and data analysis problems, we typically assume homogeneity of distributions between training and testing. Often, however, in practical situations, this assumption is violated. This can result in poor predictive performance and present difficulties in determining optimal tuning parameters through cross-validation. For example, consider Figure 1. This dataset, from [1], is composed of both raw images and landmark points of individuals at various ages. In order to preprocess the landmark data, affine transformations are removed, resulting in predictors with a non-euclidean structure [2], [3]. Often, however, testing data will be made available that has had some affine transformation applied to it (such as data obtained in the wild ). It may be the case that an inference problem only has data from the preprocessed source data, but interest lies in estimating a predictive model that will perform well on all potential data. We shall formalize this problem by operating under the assumption that the joint distribution between the response variable (Y ) and covariates (x) changes from training to testing. In domain adaptation [4], we assume that while the covariate distribution might change between two domains, the underlying mechanism to generate the response variables from the covariates does not change. In other words, the conditional distribution of the response given the covariates remains the same across all domains, while the marginal distribution of the covariates may change. These assumptions are also present in covariate shift problems [5]. Transfer learning, another approach to this problem, deals with the problem of the conditional distribution of the response given the covariates shifting between domains while the covariate distribution stays the same [6]. We assume to have independent, identically distributed (i.i.d.) data x 1,..., x n with x i x X R p with corresponding response variables y 1,..., y n for training a model (called the source data) and z 1,..., z m with z j z Z R q with ξ 1,..., ξ m for testing (called the target data ). Our main assumption is that we have unknown response variables ξ 1,..., ξ m from the same model that generated the known response variables (i.e., [Y x] [Ξ z], though x z in general). Our goal is to learn a parameterized conditional model optimal under (Ξ, z) while only knowing a small number of observations from Ξ (or knowing none in the unsupervised case). Our data will typically consist of X = [x 1,..., x n ] T R n p, y = [y 1,..., y n ] T R n, and Z = [z 1,..., z m ] T R m q. On occasion we will be blessed with some labeled examples from the target data, though all methods will be outlined for the unsupervised case, with the semisupervised case being a straightforward extension. II. INSTANCE-WEIGHTING METHODS Instance-weighting methods seek a set of weights to apply to the source data that will transform the distribution of the labeled source data into that of the unlabeled (or partially labeled) target data. This way, transporting a model from the labeled data in a different domain will yield better results. Shimodaira [5] proposed a method for correcting this discrepancy with a view toward improving predictive performance by weighting each element in the source data by an importance weight based on both the source and target density functions. A. Covariate Shift In covariate shift (CS, [5]), we assume the source data x f and target data z g for some density functions f and g. If we operate under the previous assumption (namely, that [Y x] [Ξ z] while x z), we fix a parameterization for the conditional distribution and define the Kullback-Leibler loss for the source data as L x (β) = f(x) q(y x) log p(y x, β) dy d x (Y,X ) and for the target data as

2 Fig. 1. Sample images from FG-NET originals (top), landmark points (middle), and rotated/reduced landmark points (bottom). L z (β) = g(z) q(ξ z) log p(ξ z, β) dξ d z. (Ξ,Z) We shall assume for the time being that (Y, x) and (Ξ, z) have the same support so that the loss functions L x and L z differ only through the marginal distributions f and g. Let L (n) w (β; X, y) = w(x i ) log p(y i x i, β) so that, for w(x) 1, L (n) w /n L x as n. Since we desire the loss for the target data z, we take w to be the importance weights w(x) = g(x)/f(x). In this case, /n L z as n, which is what is needed. This method is not always optimal. First, it typically will only yield improvements under a misspecification of the model (i.e., in the case when p(y x, β) is different from the true model), though for high-dimensional data even standard methods for misspecified models perform similar to the importanceweighted method [7]. Additionally, density estimation is a nontrivial problem in high dimensions, which is often the case in problems of interest. Density estimates can still be made by, for example, using radial kernels (see Section IV), though accurate estimates of the density are still difficult in this case. Even for one-dimensional problems, CS requires that supp(z) supp(x ), a restriction that can hinder the types of problems we will consider. L (n) w B. Kernel Mean Matching Kernel mean matching (KMM, [8]) seeks to generalize CS by incorporating the weights w into an objective function and optimizing over the data to find optimal weights instead of taking them as known. CS is further generalized by defining a kernel function Φ : X F and assuming still that X = Z. An expectation operator µ is defined as µ(f ) = E F [Φ(x)] where F is the cdf corresponding to the density f. The KMM procedure seeks to solve minimize w µ(g) E F [w(x)φ(x)] subject to w(x) 0 and E F [w(x)] = 1 To find w R n, we will define an empirical version of the above objective function and incorporate constraints w i [0, W ] and n n w i nɛ, analogous to the two constraints above. This empirical objective will be written as where J(w) = 1 2 wt K w k T w K ij = k(x i, x j ), k i = n m m k(x i, z j ) j=1

3 for an appropriate kernel function k(, ). The constrained minimization of J above using the empirical constraints can be solved by any quadratic programming optimization (e.g., an interior point method [9]). In [8], k is taken to be the Gaussian kernel k(u, v) = exp{ σ u v 2 } for a fixed σ. Choice of ɛ will be governed by the following result. Huang, et. al [8] showed that if w(x) [0, W ] for all x X, given x 1,..., x n i.i.d. from F, 1 n w(x i ) D N(µ w, σw) 2 where µ w = X w(x) df (x) and σ w W/2 n. This indicates that we should take ɛ = O(W/ n) for some fixed constant W. KMM is able to overcome some of the drawbacks from CS. Since weights are estimated directly, density functions no longer need to be defined explicitly (or estimated), reducing potential errors, especially in multivariate problems. Moreover, the kernel approach allows for capturing possible nonlinearities that might improve predictive performance. Unfortunately, we still have the restriction that the source and target must have the same support, and in fact still need supp(z) supp(x ) as in CS. Additionally, since a kernel approach is used, weights are only defined for the training input points so that if we desire weights for points that were not available initially (e.g., for cross validation), we will need to rerun the entire optimization. C. Kullback-Leibler Importance Estimation The Kullback-Leibler importance estimation procedure (KLIEP, [7]) attempts to improve on KMM by specifying a model for the weight function w(x) so that weights can be obtained for points not available at training and estimating them using the given data. We wish to model ĝ(x) = ŵ(x) f(x), ŵ(x) = b α l φ l (x) l=1 where {α l } are unknown parameters to be estimated and φ l 0 are fixed basis functions. The weight function ŵ will be chosen as to minimize D KL [g(x) ĝ(x)] = g(x) log g(x) f(x) d x X X g(x) log ŵ(x) d x. Since the second term is the only one with the parameters of interest, we form the loss function as J(α) = 1 m m log j=1 b α l φ l (z j ). l=1 Unfortunately J is concave so we require constraints to successfully optimize over α. First, we desire ŵ(x) 0, which we specify as α l 0 for l = 1,..., b. Additionally, since ĝ(x) = ŵ(x)f(x) should be a proper density function, we have 1 = ŵ(x) f(x) d x 1 X n l=1 b α l φ l (x i ). In practice, we will define basis functions using the target data. Here, ŵ(x) = m α l k(x, z j ) j=1 where k(, ) is the Gaussian kernel as before. The estimation of α will be done through gradient ascent of the negative of J above. We will define K ij = k(z i, z j ), k j = 1 n k(x i, z j ) as in KMM, except this time with the target data instead of the source data. In this case, the parameter update uses gradient ascent with K and constraint satisfaction using k to ensure the α are properly normalized. It can be shown that, when the Gaussian kernel is used at the test input points to estimate the weight function, KLIEP converges to the optimal solution at a rate slightly slower than O p (n 1/2 ), assuming n = m. KLIEP improves over CS and KMM in obtaining a general weight function that can be applied to points that are not available at training time. The method can be computationally intensive at times, especially when a large number of target samples are available. In this case, subsampling or clustering can be done to reduce computational cost in the estimation of the basis functions. KLIEP has similar issues to CS and KMM in that we still require X = Z, and further require F and G to be mutually absolutely continuous. III. DIMENSION REDUCTION METHODS Much of the importance weighting methods have similar drawbacks, namely that they have issues handling cases in which source and target data not only come from different distributions, but perhaps even have differing underlying structures, indicating that X = Z or supp(z) is not contained in supp(x ). In this case, transformations of the source or target data (or typically both) are desired to obtain a representation that is hopefully invariant to domain changes. Many dimension reduction methods require knowledge of the structure of the Grassmannian G(p, d). We think of this space intuitively as the set of all d-dimensional subspaces of R p, i.e., the quotient space G(p, d) = R(p, d)/

4 where R(p, d) is the space of all p d matrices of rank d, and, for U, V R p d, U V if there exists a nonsingular L R d d such that V = U L [10]. Elements of the Grassmannian are often identified by bases for their particular subspaces, e.g., as η R p d with η T η = I d. A. Intermediate Subspaces The intermediate subspace approach (IS, [11]) seeks a latent feature representation by obtaining intermediate feature spaces that help to quantify the shift from the source to the target space. In the method, the latent variables are obtained by sampling points along a geodesic on the Grassmannian G between the subspace spanned by the source dataset and the subspace spanned by the target dataset. It can be written as follows: Using principal component analysis, learn d- dimensional (d < p = q) representations of data X and Z as X = X η x and Z = Z η z for η x, η z G(p, d), and, using the geodesic along G(p, d) between η x and η z, obtain intermediate transformations η 1,..., η K to use in finding representations X X 1 X K+1 and Z ẐK Ẑ0 where X K+1 = X η z and Ẑ0 = Z η x. The newly acquired latent feature representation for the source data X is then just the concatenation of each of these matrices, which can be expressed as X * = [ X XK+1 ] with a similar representation being acquired for data Z. Partial least squares [12], hereafter called PLS, is performed to obtain a low-dimensional model operating on these expanded datasets. Some drawbacks to the IS method are its reliance on a large number of tuning parameters and the high dimensionality that must be overcome when many subspaces are desired. B. Geodesic Flow Kernel Geodesic flow kernel (GFK, [13]) seeks improvements over the IS method. GFK attempts to remove the need for sampling along the geodesic between the source and target subspaces and uses a kernel approach to mitigate the extreme dimensionality of IS when a large number of subspaces are used. As before, we have η x, η z R p d as bases for the source and target subspaces, respectively, and write their orthogonal complements as η x, η z R p (p d). We write the geodesic flow γ : [0, 1] G(p, d) as γ(t) = η x U 1 Γ(t) η x U 2 Σ(t) where U 1, U 2, Γ, Σ are given by the generalized singular value decomposition η T x η z = U 1 Γ V T, (η x ) T η z = U 2 Σ V T. We define Γ(t) and Σ(t) as diagonal matrices with cos(tθ i ) and sin(tθ i ) on the diagonal where Γ and Σ are diagonal matrices with cos(θ i ) and sin(θ i ) on the diagonal for i = 1,..., d. Our goal is to use all t (0, 1) to obtain representations γ(t) T x for x in the source domain. Computationally this is infeasible, so we proceed through a kernel approach where < Φ(u), Φ(v) >= 1 0 (γ(t) T u) T (γ(t) T v) dt = u T G v with G R p p positive semidefinite and defined through matrices obtained using the previous generalized singular value decomposition. This kernel is used to perform prediction through using kernel nearest neighbor. Potential drawbacks to both IS and GFK are that the geodesic path between two subspaces may not be the most informative, especially when further labeling information is available. Moreover, all of the issues with kernel methods that were outlined previously hold for GFK as well. C. Regularized Likelihood Directions Regularized likelihood directions (RLD, [14]) approaches the domain adaptation problem through sufficient dimension reduction [15] we seek a transformation in which η is estimated so that P (Y A η T x) = P (Y A x) for some set A. Many of the previous dimension reduction approaches can be reinterpreted as enforcing normality constraints on the source and target data [16]. Here, we make a slightly more relaxed assumption that Y takes values in {1,..., C} and x y N(µ x y, Σ x y), y = 1,..., C, where n y observations x y come from X with response variable y. Note that we write cov(x) = Σ x. Likelihood-acquired directions (LAD, [17]), seek to maximize L(η) = 1 2 log ηt Σ x η 1 2 C y=1 n y n log ηt Σ x y η over all η such that η T η = I d, where L(η) is proportional to a likelihood function. This is done through conjugate gradient ascent on G(p, d), as described in [18]. RLD proceeds in a similar fashion to KMM in that it attempts to keep the means of the source and target distributions close, this time withinclass. To do this, we define Γ λ (η, γ) = λ 2 K [ 1{ly 1} η T µ x y γ T µ z y 2] y=1 for a fixed λ > 0 as a regularization term to be included into the overall maximum likelihood framework. Here we let γ R q d as we do not necessarily have p = q. However, if p = q, letting γ = η often results in improvements because we estimate a smaller number of parameters from the same data. Since often we only have a small number of realizations from z, second moments will not be constrained. Our goal then becomes to solve arg max (η,γ) L(η) + Γ λ (η, γ).

5 While this method is defined as requiring within-class means from the target distribution, we can adapt it to match only global means in the unsupervised case. IV. EXPERIMENTS We test the outlined methods against each other (as well as a baseline method) using synthetic and real datasets. The real data considered will be the age estimation problem given in Figure 1, while the synthetic dataset is outlined below. The baseline method used is simply a least squares classifier using PCA with no adaptation. Both unsupervised and semisupervised problems will be considered. All methods were outlined for the unsupervised case. In the semisupervised case, the data used for estimating the model is augmented with the given labeled target data; for all methods this is also included as the target data used to estimate the weights or transformation. For CS, we learn density functions using a radial kernel density estimation procedure with a Gaussian kernel, i.e., ˆf n (x) = 1 1 n H K [ H -1 (x i x) ], K(u) exp [ 12 ] u 2 where H is a diagonal bandwidth matrix, with each diagonal element chosen according to the rule of thumb ĥj = n 1/(p+4)ˆσ j with ˆσ j the estimated standard deviation for column j of X [19]. A similar density estimate is obtained for g. We choose a Gaussian kernel over a more efficient kernel (e.g., Epanechnikov) due to its infinite support. For KMM, we take ɛ = 1 n 1/2 and W = 1000 as in [8], but set σ = 1e 4 as it yielded better results in practice than the authors suggestion of σ = 1e 1. For KLIEP, we take ɛ = 1e 3 and σ = 1e 1. For IS, we set the number of intermediate subspaces to 8, and for RLD we set λ = 1 for the simulation and λ =.2 for the real-world data. These tuning parameters were found to give the best results. More intelligent parameter selection can be done, e.g., in [20] for regularization parameters or [21] for bandwidths. Other types of cross-validation can be performed for all methods, though for many of them this results in a large amount of computational expense. For RLD, we use the sgmin implementation provided by Lippert and Edelman [22]. To easily incorporate weights and to reduce computational complexity a one-vsall least squares classifier is estimated for all methods except GFK, which uses kernel nearest neighbor. For the baseline and instance-weighting methods, PCA is first performed to reduce the effect of the high dimensionality. For the case where p q, features from the higher-dimensional space are transformed into the lower-dimensional space using PCA for all methods except RLD. A. Simulation Studies For the simulation studies, we generate 200 observations in R 6 for the source data and 300 observations in R 4 for the target data in three classes. We generate the source data as multivariate normal with zero mean and covariance matrix with Σ ij = 0.5 i j for i, j = 1,..., 6. (1) The target data is generated as a mixture of two normals, one with a mean vector of ones and covariance matrix Σ ij = 0.5 i j for i, j = 1,..., 4, (2) the other with a mean vector of negative ones and covariance matrix Σ ij = i j for i, j = 1,..., 4, (3) each with equal weight. We generate both η and γ with random normal entries in R 6 2 and R 4 2, respectively, and take their orthogonalizations. The labels are generated as y = η T 1 x η T 2 x +ɛ (4) where ɛ N(0,.5 2 ). The response values y are then discretized into three categories by thresholding them at their one-third and two-thirds quantiles. Similar labels are generated for the target data using γ. Each study is run ten times, and the average misclassification rate is recorded for all methods under consideration, along with the standard deviation. For the unsupervised case, we use half of the source data and half of the target data to build the model, while testing the model on the remaining half of both the source and target data. For the semisupervised studies, we build the model again on half of the source data, this time selecting three observations per class from the target data to use in parameter estimation. We then test this model on the remaining half of the source data and all 300 of the target data points. Misclassification rates are given in Table I with standard deviations given in parentheses. It seems there is no clear cut best method on either the source or the target. RLD looks to be a clear winner for all cases except semisupervised same-dimension on the source data and unsupervised differing-dimension on the target data. In the first of these exceptions all methods seem to perform equally well, with KLIEP performing best; in the second, again all methods yield similar results, this time with KMM coming out ahead. The good performance of RLD is partly due to the fact that the underlying model was linear in its arguments. Nonlinear models require more complicated techniques [14]. Note that most methods saw performance drops when moving from source to target, with most of the top methods performing better than the baseline on the target data, an indication that accounting for the domain shift is a worthwhile approach. B. Real-World Data We will consider an age estimation problem from the data at [1] described in the Introduction (see Figure 1). For all methods we first apply the inverse exponential map. This map sends predictors under consideration to the tangent space about

6 Same Dimension Different Dimension Unsup/Source Unsup/Target Semisup/Source Semisup/Target Baseline (7.48) (21.47) ( 9.02) (30.14) CS (6.65) (21.83) (11.58) (28.56) KMM (6.62) (22.04) (14.15) (25.65) KLIEP (7.55) (21.43) ( 8.97) (30.25) IS (8.19) (18.24) ( 9.73) (30.69) GFK (9.96) (14.60) (12.58) (23.55) RLD (3.88) (17.61) (10.72) (14.91) Baseline ( 8.52) (29.26) (6.15) (20.85) CS ( 8.76) (29.78) (9.01) (20.03) KMM ( 8.71) (28.85) (8.21) (20.35) KLIEP ( 8.24) (29.34) (6.21) (20.94) IS (10.15) (27.42) (8.06) (12.51) GFK (13.00) (16.31) (8.74) ( 8.20) RLD ( 4.02) (28.28) (4.38) (13.09) TABLE I MISCLASSIFICATION RATES FROM SIMULATION STUDIES. STANDARD DEVIATIONS ARE GIVEN IN PARENTHESES. the manifold on which these predictors lie, which is in turn normal Euclidean space. See [23] for numerical methods to compute this map. All results are given for the semisupervised case. The source data will be the original landmark points, while the target data consists of source landmark points rotated at a randomly sampled angle betwen 0 and pi/4. We sample with replacement 250 observations per class from the source data and 50 observations per class from the target data to build the model, while testing on the remaining source and target data. Here, classes are obtained by thresholding the continuous response variables at their median (obtained from the labeled source data). This categorical information is then used to estimate transformations in the RLD framework. The square root of the age variable is used as this guarantees nonzero estimates and has been shown to improve performance [16]. As in the simulations, the sampling is done ten times, with the average mean absolute error recorded, along with the standard deviation. Results are given in Table II, where the estimated dimension is taken to be both 8 and 16. In the same-dimension unsupervised cases, the dimension reduction methods come out ahead, being the only methods to obtain significantly better results than the baseline method. A similar scenario occurs in the different dimension case, in which we remove half of the landmark points at random. However, here the only method to perform significantly better than the baseline is the RLD method. GFK performs much worse in this case, mainly due to the fact that age estimates are obtained using a nearest neighbor method as opposed to a regression model. In all cases, CS performs poorly. Improvements to this method might be possible if more advanced, manifold-valued density estimates were considered on the original data as opposed to the transformed data that was used. V. CONCLUSION When the distribution of features is heterogeneous across training and testing sets, modifications can be made to aid in prediction. We outlined two different paradigms: instance weighting methods in which each observation is assigned a weight to even out distributional differences, and dimension reduction methods in which feature spaces are sought that minimize the discrepancy between training and testing sets. There have been many approaches within each of these frameworks. RLD works well when the underlying model is linear, and seems to perform best overall on a structured regression problem, with IS and GFK performing similarly well, at least when source and target dimensions are the same. In any case, using some method to account for a domain shift can go a long way to improving predictive performance in pattern recognition tasks. ACKNOWLEDGMENT This work was partially supported by a MURI from the Office of Naval Research under the grant REFERENCES [1] FG-NET Aging Database, Accessed Apr 2011, Face and Gesture Recognition Research Network, Available [2] C. Goodall and K. Mardia, Projective Shape Analysis, Journal of Computational and Graphical Statistics, vol. 8, no. 2, [3] P. Turaga, S. Biswas, and R. Chellappa, The Role of Geometry for Age Estimation, in IEEE International Conference on Acoustics, Speech and Signal Processing, March 2010, pp [4] H. Daumé III and D. Marcu, Domain adaptation for statistical classifiers, Journal of Artificial Intelligence Research, vol. 26, no. 1, pp , [5] H. Shimodaira, Improving Predictive Inference Under Covariate Shift by Weighting the Log-Likelihood Function, Journal of Statistical Planning and Inference, vol. 90, no. 2, pp , Oct [6] S. J. Pan and Q. Yang, A survey on transfer learning, IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp , [7] M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. von Bünau, and M. Kawanabe, Direct importance estimation for covariate shift adaptation, Annals of the Institute of Statistical Mathematics, vol. 60, no. 4, pp , [8] J. Huang, A. J. Smola, A. Gretton, K. Borgwardt, and B. Schölkopf, Correcting sample selection bias by unlabeled data, Advances in neural information processing systems, vol. 19, p. 601, [9] Y. Nesterov and A. Nemirovskii, Interior-Point Polynomial Algorithms in Convex Programming. Philadelphia, PA: SIAM Studies in Applied and Numerical Mathematics, [10] Y. Chikuse, Statistics on Special Manifolds, ser. Lecture Notes in Statistics. Springer, [11] R. Gopalan, R. Li, and R. Chellappa, Domain Adaptation for Object Recognition : An Unsupervised Approach, International Conference on Computer Vision, 2011.

7 Same Dimension Different Dimension d = 8/Source d = 8/Target d = 16/Source d = 16/Target Baseline 7.35 (0.18) (1.28) 7.06 (0.31) (1.41) CS (0.81) (0.71) (0.51) (0.45) KMM 7.35 (0.18) (1.28) 7.06 (0.31) (1.41) KLIEP 7.33 (0.18) (1.28) 7.04 (0.31) (1.41) IS 7.35 (0.24) 7.55 (0.25) 6.86 (0.33) 6.88 (0.20) GFK (0.21) 7.72 (0.91) (0.29) 7.07 (0.52) RLD 6.91 (0.20) 7.38 (0.44) 6.78 (0.23) 7.07 (0.22) Baseline 7.24 (0.22) (1.31) 6.95 (0.37) 9.72 (1.68) CS (0.64) (0.50) (0.67) (0.54) KMM 7.24 (0.22) (1.31) 6.95 (0.37) 9.72 (1.68) KLIEP 7.23 (0.22) (1.30) 6.94 (0.36) 9.69 (1.67) IS 7.15 (0.17) (0.77) 6.91 (0.27) 9.57 (0.64) GFK (0.45) (1.63) (0.30) (1.57) RLD 7.10 (0.19) 9.44 (1.06) 7.05 (0.43) 8.45 (0.70) TABLE II MEAN ABSOLUTE ERRORS FROM AGE ESTIMATION. STANDARD DEVIATIONS ARE GIVEN IN PARENTHESES. [12] H. Wold, Partial Least Squares, in Encyclopedia of Statistical Sciences, S. Kotz and N. Johnson, Eds. New York: Wiley, [13] B. Gong, Y. Shi, F. Sha, and K. Grauman, Geodesic flow kernel for unsupervised domain adaptation, IEEE Conference on Computer Vision and Pattern Recognition, [14] D. A. Shaw and R. Chellappa, Sufficient dimension reduction for domain adaptation, 2014, under review. [15] R. Cook and K. Adragni, Sufficient dimension reduction and prediction in regression, Philosophical Transactions of the Royal Society A: Physical, Mathematical and Engineering Sciences, vol. 367, no. 1906, pp , [16] D. A. Shaw and R. Chellappa, A Dimension Reduction Method for Improving Predictive Performance in Regression and Classification, Journal of the American Statistical Association, 2014, under review. [17] R. D. Cook and L. Forzani, Likelihood-based sufficient dimension reduction, Journal of the American Statistical Association, vol. 104, no. 485, pp , [18] A. Edelman, T. Arias, and S. Smith, The geometry of algorithms with orthogonality constraints, SIAM Journal on Matrix Anaysis and Applications, vol. 20, pp , [19] D. Scott, Multivariate Density Estimation: Theory, Practice, and Visualization. New York, Chichester: John Wiley & Sons, [20] G. H. Golub, M. Heath, and G. Wahba, Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter, Technometrics, vol. 21, no. 2, pp , May [21] C. R. Loader, Bandwidth Selection: Classical or Plug-In? The Annals of Statistics, vol. 27, no. 2, pp , [22] R. Lippert and A. Edelman, SGMIN, bai/et/other methods/ overview SGMIN.html. [23] K. Gallivan, A. Srivastava, X. Liu, and P. V. Dooren, Efficient algorithms for inferences on grassmann manifolds, in In Proceedings of 12th IEEE Workshop on Statistical Signal Processing, 2003, pp

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Importance Reweighting Using Adversarial-Collaborative Training

Importance Reweighting Using Adversarial-Collaborative Training Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Multiple Similarities Based Kernel Subspace Learning for Image Classification Multiple Similarities Based Kernel Subspace Learning for Image Classification Wang Yan, Qingshan Liu, Hanqing Lu, and Songde Ma National Laboratory of Pattern Recognition, Institute of Automation, Chinese

More information

ABSTRACT REGULARIZATION METHODS FOR HIGH-DIMENSIONAL INFERENCE. David A. Shaw, Doctor of Philosophy, 2014

ABSTRACT REGULARIZATION METHODS FOR HIGH-DIMENSIONAL INFERENCE. David A. Shaw, Doctor of Philosophy, 2014 ABSTRACT Title of dissertation: REGULARIZATION METHODS FOR HIGH-DIMENSIONAL INFERENCE David A. Shaw, Doctor of Philosophy, 2014 Dissertation directed by: Professor Ramalingam Chellappa Department of Electrical

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1 BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1 Shaobo Li University of Cincinnati 1 Partially based on Hastie, et al. (2009) ESL, and James, et al. (2013) ISLR Data Mining I Lecture

More information

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

Tutorial on Machine Learning for Advanced Electronics

Tutorial on Machine Learning for Advanced Electronics Tutorial on Machine Learning for Advanced Electronics Maxim Raginsky March 2017 Part I (Some) Theory and Principles Machine Learning: estimation of dependencies from empirical data (V. Vapnik) enabling

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Robustness of Principal Components

Robustness of Principal Components PCA for Clustering An objective of principal components analysis is to identify linear combinations of the original variables that are useful in accounting for the variation in those original variables.

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

More information

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract Scale-Invariance of Support Vector Machines based on the Triangular Kernel François Fleuret Hichem Sahbi IMEDIA Research Group INRIA Domaine de Voluceau 78150 Le Chesnay, France Abstract This paper focuses

More information

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Feature Extraction with Weighted Samples Based on Independent Component Analysis

Feature Extraction with Weighted Samples Based on Independent Component Analysis Feature Extraction with Weighted Samples Based on Independent Component Analysis Nojun Kwak Samsung Electronics, Suwon P.O. Box 105, Suwon-Si, Gyeonggi-Do, KOREA 442-742, nojunk@ieee.org, WWW home page:

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

PCA and admixture models

PCA and admixture models PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Chemometrics: Classification of spectra

Chemometrics: Classification of spectra Chemometrics: Classification of spectra Vladimir Bochko Jarmo Alander University of Vaasa November 1, 2010 Vladimir Bochko Chemometrics: Classification 1/36 Contents Terminology Introduction Big picture

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning Beyond the Point Cloud: From Transductive to Semi-Supervised Learning Vikas Sindhwani, Partha Niyogi, Mikhail Belkin Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Support Vector Machine (continued)

Support Vector Machine (continued) Support Vector Machine continued) Overlapping class distribution: In practice the class-conditional distributions may overlap, so that the training data points are no longer linearly separable. We need

More information

In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment.

In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment. 1 Introduction and Problem Formulation In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment. 1.1 Machine Learning under Covariate

More information

II. DIFFERENTIABLE MANIFOLDS. Washington Mio CENTER FOR APPLIED VISION AND IMAGING SCIENCES

II. DIFFERENTIABLE MANIFOLDS. Washington Mio CENTER FOR APPLIED VISION AND IMAGING SCIENCES II. DIFFERENTIABLE MANIFOLDS Washington Mio Anuj Srivastava and Xiuwen Liu (Illustrations by D. Badlyans) CENTER FOR APPLIED VISION AND IMAGING SCIENCES Florida State University WHY MANIFOLDS? Non-linearity

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine

Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine Olga Kouropteva, Oleg Okun, Matti Pietikäinen Machine Vision Group, Infotech Oulu and

More information

Support Vector Machine Classification via Parameterless Robust Linear Programming

Support Vector Machine Classification via Parameterless Robust Linear Programming Support Vector Machine Classification via Parameterless Robust Linear Programming O. L. Mangasarian Abstract We show that the problem of minimizing the sum of arbitrary-norm real distances to misclassified

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 02-01-2018 Biomedical data are usually high-dimensional Number of samples (n) is relatively small whereas number of features (p) can be large Sometimes p>>n Problems

More information

Immediate Reward Reinforcement Learning for Projective Kernel Methods

Immediate Reward Reinforcement Learning for Projective Kernel Methods ESANN'27 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), 25-27 April 27, d-side publi., ISBN 2-9337-7-2. Immediate Reward Reinforcement Learning for Projective Kernel Methods

More information

Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification

Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification N. Zabaras 1 S. Atkinson 1 Center for Informatics and Computational Science Department of Aerospace and Mechanical Engineering University

More information

Instance-based Domain Adaptation via Multi-clustering Logistic Approximation

Instance-based Domain Adaptation via Multi-clustering Logistic Approximation Instance-based Domain Adaptation via Multi-clustering Logistic Approximation FENG U, Nanjing University of Science and Technology JIANFEI YU, Singapore Management University RUI IA, Nanjing University

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot, we get creative in two

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Learning features by contrasting natural images with noise

Learning features by contrasting natural images with noise Learning features by contrasting natural images with noise Michael Gutmann 1 and Aapo Hyvärinen 12 1 Dept. of Computer Science and HIIT, University of Helsinki, P.O. Box 68, FIN-00014 University of Helsinki,

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1 Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger

More information

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations. Previously Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations y = Ax Or A simply represents data Notion of eigenvectors,

More information

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction

More information

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine learning for pervasive systems Classification in high-dimensional spaces Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version

More information

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN , Sparse Kernel Canonical Correlation Analysis Lili Tan and Colin Fyfe 2, Λ. Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong. 2. School of Information and Communication

More information

Fantope Regularization in Metric Learning

Fantope Regularization in Metric Learning Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France Introduction

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

Conjugate-Gradient. Learn about the Conjugate-Gradient Algorithm and its Uses. Descent Algorithms and the Conjugate-Gradient Method. Qx = b.

Conjugate-Gradient. Learn about the Conjugate-Gradient Algorithm and its Uses. Descent Algorithms and the Conjugate-Gradient Method. Qx = b. Lab 1 Conjugate-Gradient Lab Objective: Learn about the Conjugate-Gradient Algorithm and its Uses Descent Algorithms and the Conjugate-Gradient Method There are many possibilities for solving a linear

More information

Riemannian Metric Learning for Symmetric Positive Definite Matrices

Riemannian Metric Learning for Symmetric Positive Definite Matrices CMSC 88J: Linear Subspaces and Manifolds for Computer Vision and Machine Learning Riemannian Metric Learning for Symmetric Positive Definite Matrices Raviteja Vemulapalli Guide: Professor David W. Jacobs

More information

Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation

Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation irect Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation Masashi Sugiyama Tokyo Institute of Technology sugi@cs.titech.ac.jp Shinichi Nakajima Nikon Corporation

More information

L26: Advanced dimensionality reduction

L26: Advanced dimensionality reduction L26: Advanced dimensionality reduction The snapshot CA approach Oriented rincipal Components Analysis Non-linear dimensionality reduction (manifold learning) ISOMA Locally Linear Embedding CSCE 666 attern

More information

Reliability Monitoring Using Log Gaussian Process Regression

Reliability Monitoring Using Log Gaussian Process Regression COPYRIGHT 013, M. Modarres Reliability Monitoring Using Log Gaussian Process Regression Martin Wayne Mohammad Modarres PSA 013 Center for Risk and Reliability University of Maryland Department of Mechanical

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Multiclass Classification-1

Multiclass Classification-1 CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass

More information

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21 CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Sparse representation classification and positive L1 minimization

Sparse representation classification and positive L1 minimization Sparse representation classification and positive L1 minimization Cencheng Shen Joint Work with Li Chen, Carey E. Priebe Applied Mathematics and Statistics Johns Hopkins University, August 5, 2014 Cencheng

More information

Metric Embedding for Kernel Classification Rules

Metric Embedding for Kernel Classification Rules Metric Embedding for Kernel Classification Rules Bharath K. Sriperumbudur University of California, San Diego (Joint work with Omer Lang & Gert Lanckriet) Bharath K. Sriperumbudur (UCSD) Metric Embedding

More information

Approximating the Covariance Matrix with Low-rank Perturbations

Approximating the Covariance Matrix with Low-rank Perturbations Approximating the Covariance Matrix with Low-rank Perturbations Malik Magdon-Ismail and Jonathan T. Purnell Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180 {magdon,purnej}@cs.rpi.edu

More information

Nonlinear Dimensionality Reduction

Nonlinear Dimensionality Reduction Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Kernel PCA 2 Isomap 3 Locally Linear Embedding 4 Laplacian Eigenmap

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection Instructor: Herke van Hoof (herke.vanhoof@cs.mcgill.ca) Based on slides by:, Jackie Chi Kit Cheung Class web page:

More information

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines Nonlinear Support Vector Machines through Iterative Majorization and I-Splines P.J.F. Groenen G. Nalbantov J.C. Bioch July 9, 26 Econometric Institute Report EI 26-25 Abstract To minimize the primal support

More information

Covariate Shift in Hilbert Space: A Solution via Surrogate Kernels

Covariate Shift in Hilbert Space: A Solution via Surrogate Kernels Kai Zhang kzhang@nec-labs.com NEC Laboratories America, Inc., 4 Independence Way, Suite 2, Princeton, NJ 854 Vincent W. Zheng Advanced Digital Sciences Center, 1 Fusionopolis Way, Singapore 138632 vincent.zheng@adsc.com.sg

More information

CS 7140: Advanced Machine Learning

CS 7140: Advanced Machine Learning Instructor CS 714: Advanced Machine Learning Lecture 3: Gaussian Processes (17 Jan, 218) Jan-Willem van de Meent (j.vandemeent@northeastern.edu) Scribes Mo Han (han.m@husky.neu.edu) Guillem Reus Muns (reusmuns.g@husky.neu.edu)

More information

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26 Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 1 / 26 Initial Question Intro Question Question Let S R n n be symmetric. 1

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018 Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html

More information

Sparse Covariance Selection using Semidefinite Programming

Sparse Covariance Selection using Semidefinite Programming Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information