Transductive Experiment Design

Size: px
Start display at page:

Download "Transductive Experiment Design"

Transcription

1 Appearing in NIPS 2005 workshop Foundations of Active Learning, Whistler, Canada, December, Transductive Experiment Design Kai Yu, Jinbo Bi, Volker Tresp Siemens AG Munich, Germany Abstract This paper considers the problem of selecting the most informative experiments x to get measures y for learning an inference model y = f(x). We propose a novel concept for active learning, transductive experiment design, to overcome the shortcomings of existing experiment design methods, e.g. insufficient exploration of available unmeasured data and poor scalability for large data sets. In-depth analysis clearly shows that the method tends to favor experiments that are hard to predict and meanwhile typical in representing remaining hard-to-predict data. Efficient solutions are further developed through mathematical programming techniques. Encouraging results on toy problems and real-world data sets are included to highlight the advantages of the proposed approaches. 1 Experiment Design The problem of active learning is often referred as experiment design in statistics (see [1, 2]). Formally, in order to learn a function f(x) = w x, w R d, one has to take measurements or experiments y i = w x i + ɛ i, i = 1,..., m, where ɛ i N (0, σ 2 ) and y i N (w x, σ 2 ). Let x 1,..., x m in experiments be chosen among n possible test data v 1,..., v n R d, n > m. The goal of experiment design is to choose m vectors x i, from among the possible choices, so that the estimation error is small. In other words, the task is to find a set of data x i that together are maximally informative. arg min m ( ) w w 2 x i y i gives the maximum-likelihood estimate of w. The estimation error ŵ w has zero mean and covariance matrix C w = σ 2 (X X) 1 [2]. 1 The matrix C w characterizes the accuracy of the estimation, or the informativeness of the experiments. Let m j denote the number of experiments for which v j is chosen in X, where m m n = m. The so-called A-optimal design minimizes the trace of C w, namely 1 In the rest of this paper, we use X to represent both the matrix [x 1,..., x m] R m d and the index set {x i}, and V to represent both [v 1,..., v n] R n d and the index set {v i}. Their meanings will be clear in contexts.

2 minimize Tr[( n j=1 mjvjv j ) 1 ], subject to m j 0, m m n = m, m i Z where σ 2 is removed from the objective function since it is a constant. The integer constraint on m i can be relaxed. There are also other variants of experiment design problems like D-optimal design and E-optimal design. All of them are semidefinite programming problems (SDP) and can be explained to find a minimum ellipsoid to include data V. Their difference lies in how to measure the size of an ellipsoid. 2 Transductive Experiment Design Classical experiment design 2 has some shortcomings. First, the optimization criteria based on C w are indirect indicators of functions qualities. Since the learnt function will be used to make predictions on future test data, it is more desired to directly assess the prediction quality on test data; Second, minimizing the variance of w amounts to minimize the variance of w x in the entire x space. However, if one is only interested in predicting well on data non-uniformly distributed, it is unnecessary or even harmful to apply the classical experiment design; Third, the number of experiments is implicitly required to be no less than the dimensions of input data x i, because when m < d, the matrix X X is not invertible. The problem becomes serious if the dimensionality of input data is thousands but the budget can only afford a few experiments. Finally, classical experiment design has to solve a semidefinite programming (SDP) problem, which is often very slow when dealing with hundreds of data points. In order to overcome all the shortcomings, we perform experiment design in a transductive setting, where the focus is on the predictive performance on test data given beforehand. 2.1 General Transductive Experiment Design A general setting may consider a different set T of test points besides experiment candidates V. For simplification we suppose that the two sets are the same, without loss of generality. Let us assume w follow a Gaussian distribution N (0, ν 2 I) a priori, where I is a d d identity matrix. Based on the training examples {x i, y i} observed from experiments y i = f(x i) + ɛ i, ɛ i N (0, σ 2 ), the function weights w are estimated by min m ( ) w w 2 x i y i + µ w 2 where µ = σ2 > 0 and is the ν vector 2-norm. Following a similar procedure as before, the covariance 2 of the estimation error for w is computed as C w = σ 2 (X X + µi) 1, where X X + µi is always full-rank. Let f = [f(v 1),..., f(v n)] be the function values on all the available data V, it is easy to know that the prediction error has the following covariance matrix C f = E[(f ˆf)(f ˆf) ] = VC wv. In contrast to C w in the classical design, C f directly characterizes the quality of predictions on the target data. Applying the Woodbury inversion identity, it is not difficult to see that minimizing the trace of C f can be formulated as [ maximize Tr VX (XX + µi) 1 XV ] (1) subject to X V, X = m where X = m restricts the number of chosen experiments to be m. VX (XX + µi) 1 XV plays a key role in our following discussion. The matrix Theorem 2.1 Let χ( ) be the projection onto a subspace in R d which is orthogonal complement to span(x 1,..., x m). The objective function in (1) is equivalent to ϕ(v i) 2 = v i 2 χ(v i) 2 ψ(v i) 2, (2) with ψ(v) = m j=1 µ λ j +µ hjh j v, ϕ(v) = m j=1 λj λ j +µ hjh j v, where h 1,..., h m λ 1... λ m 0 are respectively the eigenvectors and eigenvalues of X X. 2 In the rest of this paper, we often refer existing experiment design as classical experiment design. and

3 Due to the space limitation, we omit all the proofs. Details are given in a longer version [5]. We have a few comments for the transductive experiment design: Since ϕ(v i) 2 is upper bounded by v i 2, those v i with a bigger v i have the higher potential to produce a very large ϕ(v i) 2. Therefore those selected experiments X forming ϕ( ) should include those v i with a bigger v i. Since E[f(v i) 2 ] = v i E(ww )v i = v i 2 indicates that the norm v i 2 encodes the prior uncertainty of functions on v i, transductive experiment design tends to select those experiments with uncertain outcomes; On the other hand, maximizing Tr(ZZ ) indicates that n χ(vi) 2 should be small, namely V s projections onto the orthogonal complement subspace should be as small as possible. Therefore, essentially the optimization problem tries to find the optimum set X of experiments that retains the information of V in span(x 1,..., x m) as much as possible. That means, transductive experiment design tends to select those experiments that are representative in V; Due to the regularization, minimizing ψ(v i) 2 implies that V should be more correlated with X X s leading eigenvectors. Therefore, transductive experiment design tends to select experiments X whose significant patterns capture the information of V. Transductive experiment design combines the three criteria in a unified framework. In some sense, classical experimental design only considers the first criterion, since it picks up those experiments that are on the surface of the minimum volume ellipsoid and thus faraway from the origin (i.e. with a big norm). The key contributor to the second criterion is the idea of transduction, namely, only focusing on the predictions on the target cases V. The third criterion can be seen as a refinement to the second one caused by the effect of regularization. Following a terminology similar to the classical experiment design, we call the problem Eq. (1) as A-optimal transductive design 3. Now we are ready to handle experiment design with nonlinear functions, by introducing the kernelized version (see details in [5]). 3 A-optimal Transductive Experiment Design Various design strategies can be employed to conduct transductive experiment design. We give examples on how we establish A-optimal design solutions. Please consult [5] for more complete discussions. A-optimal design needs to solve a difficult combinatorial optimization problem when m > 1. Fortunately, it has an equivalent formulation as follows: minimize subject to π i q i K vx c i 2 + µπ i c i 2 (3) X V, X = m, C = [c 1,..., c n ] R n m where Q = [q 1,..., q n ] and π 1..., π n are the eigenvectors and eigenvalues of K = VV, and K vx = VX. Then clearly, A-optimal design seeks a subset of m experiments X so that the best approximation of leading eigenvectors of K can be constructed using X. Instead of minimizing the quadratic loss as in Eq(3), many recent works [3, 6] have shown that the absolute deviation loss together with the 1-norm regularization is equally suitable to learning inference models, and resulting 3 One can also consider other variants of the objective function, such as minimizing the 2-norm of C f (see [5]. In this paper we mainly focus on the A-optimal transductive design.

4 linear programs can be efficiently solved. We design a novel algorithm that aims to best approximate leading eigenvectors of K in terms of the absolute-deviation loss and the 1-norm regularization condition to enhance scalability. min β 0 min αi n π i q i KBα i 1 + µπ i Bα i 1 + γ β 1 (4) where B is an n n diagonal matrix with its j-th diagonal element equal to β j {0, 1} indicating whether or not an according experiment will appear in X. An alternating optimization procedure is applied to develop an iterative algorithm which performs two major steps at each iteration. The first step fixes B, converts K KB, and solves the following problem for optimal α i, ) n min αi,ξ i,s i (e ξ i + µπ i β s i s.t. πi q i Kα i ξ i, Kαi π i q i ξ i, s i α i s i, i = 1,, n. (5) The second step fixes α i to the above solution, converts K i K diag(α i ), and solves the following problem for optimal ˆβ, n min βi,ξ i,s i e ξ i + γe β s.t. πi q i K i β ξ i, K i β (6) π i q i ξ i, i = 1,, n, β 0. Note that both problems (5) and (6) are linear programs (LP), and can be solved efficiently. We hence call it LP A-optimal algorithm. Further, problem (5) can be de-coupled to optimize each α i separately by minimizing e ξ i + µπ i β s i with constraints π i q i Kα i ξ i, Kαi π i q i ξ i, s i α i s i. These n subproblems are very small and hence scalable. We also derive a greedy algorithm that sequentially selects m experiments based on the following results: Let experiments X be formed by two disjoint sets X 1, X 2 V and C f X denote the predictive variance matrix C f given X, we have 1 ν 2 C f X = K (1) K (1) (K (1) xx + µi) 1 K (1) (7) where K (1) = 1 ν 2 C f X 1. The sequential A-optimal algorithm repeats the following two steps until m experiments have been selected. (1) Select x V with the highest k(x, ) 2 /k(x, x), and add x into X, where k(x, ) is x s corresponding column in current K; (2) Update K K K v,x (K x,x +λi) 1 K x,v based on current X. As in other scenarios, this greedy approximation method demonstrated a high efficiency in our empirical study. 4 Experiments 4.1 Toy Problem: Four Gaussians We generated a toy problem with four Gaussian components in 2-dimensional space, as shown in Fig. 1-(a), and tested experiment design with m = 4. Classical experiment design attempts to reduce the predictive variance with respect to the entire input space. In Fig. 1-(b) the shadow contours (the darker, the lower the variance is) caused by selected specific experiments cover a large region of input space where no data exist. In contrast, as shown in Fig. 1-(c) and (d), the two variants of A-optimal transductive design approaches, Algorithm 3 and 2, make efforts to reduce the predictive variance of targeted data (experiments), thus select almost the cluster centers.

5 (a) the data (b) classical design (c) sequential transductive design (d) transductive design Figure 1: Experimental design (m = 4) on a toy problem with four Gaussian components. The big red triangle markers indicate the selected data points, gray levels and contours indicate the predictive variance of the learnt function in the entire input space (darker means lower variance). The both transductive design methods present better results than the classical design. 4.2 Text Categorization: Newsgroup and RCV1 Data Sets In this subsection we validate our proposed experiment design approaches on the supervised text categorization task, with two data sets, Newsgroup corpus and RCV1 corpus. We solved two-class classification problems that conducted oneagainst-all scheme for each category. The data points selected by our transductive design, as well as their labels +1 or 1, are used by a kernel ridge regression method with linear kernels. The learning model has shown the state-of-art performance for text categorization. We also examined the performance of random sampling for linear ridge regression and active learning with SVMs. The SVM used in this study is the algorithm described in [4] that selects data points the closest to the decision boundary. As a matter of fact, we cannot run classical A-optimal design on the text categorization problem, since our solver for SDP is not scalable to the problem. The results are shown in Fig. 2. The two transductive experiment design methods consistently and significantly outperform other methods in comparison on both data sets. For example, the classification accuracy based on just 10 selected training examples achieved the mean AUC score 90.2% on Newsgroup and 74.0% on RCV1, in contrast to 77.0% and 64.9% achieved by random sampling. The error bars of transductive design are also much smaller than those compared methods. Furthermore, on both data sets non-sequential transductive design outperforms sequential solutions, which confirms the sequential greedy solution is less optimal than nonsequential version. The advantage of the non-sequential solution over the sequential one is particularly apparent for Newsgroup data.

6 Figure 2: Text categorization accuracy (AUC score) based on training data selected by different methods, on (left) Newsgroup data set and (right) RCV1 data set Interestingly, active learning using SVMs performs worse than random sampling on Newsgroup, while outperforms random sampling on RCV1. The results can be interpreted by the facts that Newsgroup exhibits a clear clustering structure. SVM active learning tends to select data points near classification boundary, which can easily find outliers when a strong cluster structure exists. In other words, SVM active leaning might be unsuitable to data sets, like Newsgroup, that have a strong cluster structure. References [1] Atkinson, A. C. and Donev, A. N. Optimum experiment designs. Oxford Statistical Science Series. Oxford University Press, [2] Boyd, S. and Vandenberghe, L. Convex Optimization. Cambridge University Press, [3] Tibshirani, R. Regression selection and shrinkage via the lasso. Journal of the Royal Statistical Society Series B, 58(1): , [4] Tong, S. Active Learning: Theory and Applicaitons. Ph.D. thesis, Stanford University, [5] Yu, K., Bi, J., and Tresp, V. Active Learning via Transductive Experiment Design. Siemens AG, Submitted. [6] Zhu, J., Rosset, S., Hastie, T., and Tibshirani, R. 1-norm support vector machines. In S. Thrun, L. Saul, and B. Schölkopf, eds., Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.

Active Learning via Transductive Experimental Design

Active Learning via Transductive Experimental Design Kai Yu kai.yu@siemens.com Siemens, Corporate Technology, Otto-Hahn-Ring 6, Munich 81739, Germany Jinbo Bi jinbo.bi@siemens.com Siemens, Medical Solutions, 51 Valley Stream Parkway, Malvern PA 19355, USA

More information

Learning to Learn and Collaborative Filtering

Learning to Learn and Collaborative Filtering Appearing in NIPS 2005 workshop Inductive Transfer: Canada, December, 2005. 10 Years Later, Whistler, Learning to Learn and Collaborative Filtering Kai Yu, Volker Tresp Siemens AG, 81739 Munich, Germany

More information

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1). Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes

More information

How to learn from very few examples?

How to learn from very few examples? How to learn from very few examples? Dengyong Zhou Department of Empirical Inference Max Planck Institute for Biological Cybernetics Spemannstr. 38, 72076 Tuebingen, Germany Outline Introduction Part A

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Variable Selection in Data Mining Project

Variable Selection in Data Mining Project Variable Selection Variable Selection in Data Mining Project Gilles Godbout IFT 6266 - Algorithmes d Apprentissage Session Project Dept. Informatique et Recherche Opérationnelle Université de Montréal

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

c 4, < y 2, 1 0, otherwise,

c 4, < y 2, 1 0, otherwise, Fundamentals of Big Data Analytics Univ.-Prof. Dr. rer. nat. Rudolf Mathar Problem. Probability theory: The outcome of an experiment is described by three events A, B and C. The probabilities Pr(A) =,

More information

Multi-Label Informed Latent Semantic Indexing

Multi-Label Informed Latent Semantic Indexing Multi-Label Informed Latent Semantic Indexing Shipeng Yu 12 Joint work with Kai Yu 1 and Volker Tresp 1 August 2005 1 Siemens Corporate Technology Department of Neural Computation 2 University of Munich

More information

Robust Fisher Discriminant Analysis

Robust Fisher Discriminant Analysis Robust Fisher Discriminant Analysis Seung-Jean Kim Alessandro Magnani Stephen P. Boyd Information Systems Laboratory Electrical Engineering Department, Stanford University Stanford, CA 94305-9510 sjkim@stanford.edu

More information

Convex Methods for Transduction

Convex Methods for Transduction Convex Methods for Transduction Tijl De Bie ESAT-SCD/SISTA, K.U.Leuven Kasteelpark Arenberg 10 3001 Leuven, Belgium tijl.debie@esat.kuleuven.ac.be Nello Cristianini Department of Statistics, U.C.Davis

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem Set 2 Due date: Wednesday October 6 Please address all questions and comments about this problem set to 6867-staff@csail.mit.edu. You will need to use MATLAB for some of

More information

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH Hoang Trang 1, Tran Hoang Loc 1 1 Ho Chi Minh City University of Technology-VNU HCM, Ho Chi

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015 Machine Learning Regression basics Linear regression, non-linear features (polynomial, RBFs, piece-wise), regularization, cross validation, Ridge/Lasso, kernel trick Marc Toussaint University of Stuttgart

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010 Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X

More information

Homework 4. Convex Optimization /36-725

Homework 4. Convex Optimization /36-725 Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

Multi-Output Regularized Projection

Multi-Output Regularized Projection Multi-Output Regularized Projection Kai Yu Shipeng Yu Volker Tresp Corporate Technology Institute for Computer Science Corporate Technology Siemens AG, Germany University of Munich, Germany Siemens AG,

More information

Jeff Howbert Introduction to Machine Learning Winter

Jeff Howbert Introduction to Machine Learning Winter Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable

More information

MTTTS16 Learning from Multiple Sources

MTTTS16 Learning from Multiple Sources MTTTS16 Learning from Multiple Sources 5 ECTS credits Autumn 2018, University of Tampere Lecturer: Jaakko Peltonen Lecture 6: Multitask learning with kernel methods and nonparametric models On this lecture:

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Convex Optimization in Classification Problems

Convex Optimization in Classification Problems New Trends in Optimization and Computational Algorithms December 9 13, 2001 Convex Optimization in Classification Problems Laurent El Ghaoui Department of EECS, UC Berkeley elghaoui@eecs.berkeley.edu 1

More information

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated Fall 3 Computer Vision Overview of Statistical Tools Statistical Inference Haibin Ling Observation inference Decision Prior knowledge http://www.dabi.temple.edu/~hbling/teaching/3f_5543/index.html Bayesian

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Optimal design of experiments

Optimal design of experiments Optimal design of experiments Session 4: Some theory Peter Goos / 40 Optimal design theory continuous or approximate optimal designs implicitly assume an infinitely large number of observations are available

More information

Ellipsoidal Kernel Machines

Ellipsoidal Kernel Machines Ellipsoidal Kernel achines Pannagadatta K. Shivaswamy Department of Computer Science Columbia University, New York, NY 007 pks03@cs.columbia.edu Tony Jebara Department of Computer Science Columbia University,

More information

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University UNDERDETERMINED LINEAR EQUATIONS We

More information

Techinical Proofs for Nonlinear Learning using Local Coordinate Coding

Techinical Proofs for Nonlinear Learning using Local Coordinate Coding Techinical Proofs for Nonlinear Learning using Local Coordinate Coding 1 Notations and Main Results Denition 1.1 (Lipschitz Smoothness) A function f(x) on R d is (α, β, p)-lipschitz smooth with respect

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine learning for pervasive systems Classification in high-dimensional spaces Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version

More information

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Approximate Kernel Methods

Approximate Kernel Methods Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Discriminative K-means for Clustering

Discriminative K-means for Clustering Discriminative K-means for Clustering Jieping Ye Arizona State University Tempe, AZ 85287 jieping.ye@asu.edu Zheng Zhao Arizona State University Tempe, AZ 85287 zhaozheng@asu.edu Mingrui Wu MPI for Biological

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31 Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking Dengyong Zhou zhou@tuebingen.mpg.de Dept. Schölkopf, Max Planck Institute for Biological Cybernetics, Germany Learning from

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă mmp@stat.washington.edu Reading: Murphy: BIC, AIC 8.4.2 (pp 255), SRM 6.5 (pp 204) Hastie, Tibshirani

More information

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so. CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic

More information

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection

More information

Statistical Methods for Data Mining

Statistical Methods for Data Mining Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find

More information

Chemometrics: Classification of spectra

Chemometrics: Classification of spectra Chemometrics: Classification of spectra Vladimir Bochko Jarmo Alander University of Vaasa November 1, 2010 Vladimir Bochko Chemometrics: Classification 1/36 Contents Terminology Introduction Big picture

More information

Support Vector Machine. Industrial AI Lab.

Support Vector Machine. Industrial AI Lab. Support Vector Machine Industrial AI Lab. Classification (Linear) Autonomously figure out which category (or class) an unknown item should be categorized into Number of categories / classes Binary: 2 different

More information

Logistic Regression and Boosting for Labeled Bags of Instances

Logistic Regression and Boosting for Labeled Bags of Instances Logistic Regression and Boosting for Labeled Bags of Instances Xin Xu and Eibe Frank Department of Computer Science University of Waikato Hamilton, New Zealand {xx5, eibe}@cs.waikato.ac.nz Abstract. In

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Kernel Logistic Regression and the Import Vector Machine

Kernel Logistic Regression and the Import Vector Machine Kernel Logistic Regression and the Import Vector Machine Ji Zhu and Trevor Hastie Journal of Computational and Graphical Statistics, 2005 Presented by Mingtao Ding Duke University December 8, 2011 Mingtao

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Support Vector Machines Explained

Support Vector Machines Explained December 23, 2008 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Local Learning Projections

Local Learning Projections Mingrui Wu mingrui.wu@tuebingen.mpg.de Max Planck Institute for Biological Cybernetics, Tübingen, Germany Kai Yu kyu@sv.nec-labs.com NEC Labs America, Cupertino CA, USA Shipeng Yu shipeng.yu@siemens.com

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

More information

Multi-Label Informed Latent Semantic Indexing

Multi-Label Informed Latent Semantic Indexing Multi-Label Informed Latent Semantic Indexing Kai Yu, Shipeng Yu, Volker Tresp Siemens AG, Corporate Technology, Information and Communications, Munich, Germany Institute for Computer Science, University

More information

Generalization to a zero-data task: an empirical study

Generalization to a zero-data task: an empirical study Generalization to a zero-data task: an empirical study Université de Montréal 20/03/2007 Introduction Introduction and motivation What is a zero-data task? task for which no training data are available

More information

Regularization and Variable Selection via the Elastic Net

Regularization and Variable Selection via the Elastic Net p. 1/1 Regularization and Variable Selection via the Elastic Net Hui Zou and Trevor Hastie Journal of Royal Statistical Society, B, 2005 Presenter: Minhua Chen, Nov. 07, 2008 p. 2/1 Agenda Introduction

More information

(2) I. INTRODUCTION (3)

(2) I. INTRODUCTION (3) 2686 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 5, MAY 2010 Active Learning and Basis Selection for Kernel-Based Linear Models: A Bayesian Perspective John Paisley, Student Member, IEEE, Xuejun

More information

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso COMS 477 Lecture 6. Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso / 2 Fixed-design linear regression Fixed-design linear regression A simplified

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

Tractable Upper Bounds on the Restricted Isometry Constant

Tractable Upper Bounds on the Restricted Isometry Constant Tractable Upper Bounds on the Restricted Isometry Constant Alex d Aspremont, Francis Bach, Laurent El Ghaoui Princeton University, École Normale Supérieure, U.C. Berkeley. Support from NSF, DHS and Google.

More information

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines: Maximum Margin Classifiers Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 16, 2008 Piotr Mirowski Based on slides by Sumit Chopra and Fu-Jie Huang 1 Outline What is behind

More information

What is semi-supervised learning?

What is semi-supervised learning? What is semi-supervised learning? In many practical learning domains, there is a large supply of unlabeled data but limited labeled data, which can be expensive to generate text processing, video-indexing,

More information

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 14, 2014 Today s Schedule Course Project Introduction Linear Regression Model Decision Tree 2 Methods

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Least Squares SVM Regression

Least Squares SVM Regression Least Squares SVM Regression Consider changing SVM to LS SVM by making following modifications: min (w,e) ½ w 2 + ½C Σ e(i) 2 subject to d(i) (w T Φ( x(i))+ b) = e(i), i, and C>0. Note that e(i) is error

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Gaussian Process Regression: Active Data Selection and Test Point. Rejection. Sambu Seo Marko Wallat Thore Graepel Klaus Obermayer

Gaussian Process Regression: Active Data Selection and Test Point. Rejection. Sambu Seo Marko Wallat Thore Graepel Klaus Obermayer Gaussian Process Regression: Active Data Selection and Test Point Rejection Sambu Seo Marko Wallat Thore Graepel Klaus Obermayer Department of Computer Science, Technical University of Berlin Franklinstr.8,

More information

Sparse and Robust Optimization and Applications

Sparse and Robust Optimization and Applications Sparse and and Statistical Learning Workshop Les Houches, 2013 Robust Laurent El Ghaoui with Mert Pilanci, Anh Pham EECS Dept., UC Berkeley January 7, 2013 1 / 36 Outline Sparse Sparse Sparse Probability

More information

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart Machine Learning Bayesian Regression & Classification learning as inference, Bayesian Kernel Ridge regression & Gaussian Processes, Bayesian Kernel Logistic Regression & GP classification, Bayesian Neural

More information

References. Lecture 7: Support Vector Machines. Optimum Margin Perceptron. Perceptron Learning Rule

References. Lecture 7: Support Vector Machines. Optimum Margin Perceptron. Perceptron Learning Rule References Lecture 7: Support Vector Machines Isabelle Guyon guyoni@inf.ethz.ch An training algorithm for optimal margin classifiers Boser-Guyon-Vapnik, COLT, 992 http://www.clopinet.com/isabelle/p apers/colt92.ps.z

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2015 Announcements TA Monisha s office hour has changed to Thursdays 10-12pm, 462WVH (the same

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

More information

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015 Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary

More information

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp. On different ensembles of kernel machines Michiko Yamana, Hiroyuki Nakahara, Massimiliano Pontil, and Shun-ichi Amari Λ Abstract. We study some ensembles of kernel machines. Each machine is first trained

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

On construction of constrained optimum designs

On construction of constrained optimum designs On construction of constrained optimum designs Institute of Control and Computation Engineering University of Zielona Góra, Poland DEMA2008, Cambridge, 15 August 2008 Numerical algorithms to construct

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Course Info. Instructor: Nicholas Ruozzi Office: ECSS 3.409 Office hours: Tues.

More information

SVM-based Feature Selection by Direct Objective Minimisation

SVM-based Feature Selection by Direct Objective Minimisation SVM-based Feature Selection by Direct Objective Minimisation Julia Neumann, Christoph Schnörr, and Gabriele Steidl Dept. of Mathematics and Computer Science University of Mannheim, 683 Mannheim, Germany

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 02-01-2018 Biomedical data are usually high-dimensional Number of samples (n) is relatively small whereas number of features (p) can be large Sometimes p>>n Problems

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines Nonlinear Support Vector Machines through Iterative Majorization and I-Splines P.J.F. Groenen G. Nalbantov J.C. Bioch July 9, 26 Econometric Institute Report EI 26-25 Abstract To minimize the primal support

More information

Fantope Regularization in Metric Learning

Fantope Regularization in Metric Learning Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France Introduction

More information

Learning Kernel Parameters by using Class Separability Measure

Learning Kernel Parameters by using Class Separability Measure Learning Kernel Parameters by using Class Separability Measure Lei Wang, Kap Luk Chan School of Electrical and Electronic Engineering Nanyang Technological University Singapore, 3979 E-mail: P 3733@ntu.edu.sg,eklchan@ntu.edu.sg

More information