ECS289: Scalable Machine Learning
|
|
- Elaine Morton
- 6 years ago
- Views:
Transcription
1 ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016
2 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels
3 Multiclass Learning n data points, L labels, d features Input: training data {x i, y i } n i=1 : Each x i is a d dimensional feature vector Each y i {1,..., L} is the corresponding label Each training data belongs to one category Goal: find a function to predict the correct label f (x) y
4 Multi-label Problems n data points, L labels, d features Input: training data {x i, y i } n i=1 : Each x i is a d dimensional feature vector Each y i {0, 1} L is a label vector (or Y i {1, 2,..., L}) Example: y i = [0, 0, 1, 0, 0, 1, 1] (or Y i = {3, 6, 7}) Each training data can belong to multiple categories Goal: Given a testing sample x, predict the correct labels
5 Illustration Multiclass: each row of L has exact one 1 Multilabel: each row of L can have multiple ones
6 Measure the accuracy (multi-class) Let {x i, y i } m i=1 be a set of testing data for multi-class problems Let z 1,..., z m be the prediction for each testing data Accuracy: (I ( ) is the indicator function) 1 m m I (y i = z i ) i=1
7 Measure the accuracy (multi-class) Let {x i, y i } m i=1 be a set of testing data for multi-class problems Let z 1,..., z m be the prediction for each testing data Accuracy: (I ( ) is the indicator function) 1 m m I (y i = z i ) i=1 If the algorithm outputs a set of k potential labels for each sample: Each Z i is a set of k labels Precision@k: Z 1, Z 2,..., Z m 1 m m I (y i Z i ) i=1
8 Example (multiclass)
9 Example (multiclass)
10 Example (multiclass)
11 Example (multiclass)
12 Measure the accuracy (multi-label) Let {x i, y i } m i=1 be a set of testing data for multi-label problems For each testing data, the classifier outputs z i {0, 1} L We use Y i = {t : (y i ) t 0} and Z i = {t : (z i ) t 0} to denote the subset of real and predictive labels for the i-th testing data. If each Z i has k elements, then Precision@k is defined by: 1 m m i=1 Z i Y i k Hamming loss: overall classification performance 1 m m d(y i, z i ), i=1 where d(y i, z i ) measures the number of places where y i and z i differ. ROC curve (assume the classifier predicts a ranking of labels for each data point)
13 Example (multilabel)
14 Example (multilabel)
15 Example (multilabel)
16 Traditional Approach Many algorithms for binary classification Idea: transform multi-class or multi-label problems to multiple binary classification problems Two approaches: One versus All (OVA) One versus One (OVO)
17 One Versus All (OVA) Multi-class/multi-label problems with L categories Build L different binary classifiers For the t-th classifier: Positive samples: all the points in class t ({x i : t y i }) Negative samples: all the points not in class t ({x i : t / y i }) f t (x): the decision value for the t-th classifier (larger f t higher probability that x in class t) Prediction: f (x) = arg max f t (x) t Example: using SVM to train each binary classifier.
18 One Versus One (OVO) Multi-class/multi-label problems with L categories Build L(L 1) different binary classifiers For the (s, t)-th classifier: Positive samples: all the points in class s ({x i : s y i }) Negative samples: all the points in class t ({x i : t y i }) f s,t (x): the decision value for this classifier (larger f s,t (x) label s has higher probability than label t) f t,s (x) = f s,t (x) Prediction: ( ) f (x) = arg max s t f s,t (x) Example: using SVM to train each binary classifier. Mainly for multi-class problems.
19 OVA vs OVO Prediction accuracy: depends on datasets Computational time: OVA needs to train L classifiers OVO needs to train L(L 1)/2 classifiers Is OVA always faster than OVO?
20 OVA vs OVO Prediction accuracy: depends on datasets Computational time: OVA needs to train L classifiers OVO needs to train L(L 1)/2 classifiers Is OVA always faster than OVO? NO, depends on the time complexity of the binary classifier If the binary classifier requires O(n) time for n samples: OVA and OVO have similar time complexity If the binary classifier requires O(n 1.xx ) time: OVO is faster than OVA
21 OVA vs OVO Prediction accuracy: depends on datasets Computational time: OVA needs to train L classifiers OVO needs to train L(L 1)/2 classifiers Is OVA always faster than OVO? NO, depends on the time complexity of the binary classifier If the binary classifier requires O(n) time for n samples: OVA and OVO have similar time complexity If the binary classifier requires O(n 1.xx ) time: OVO is faster than OVA LIBSVM (kernel SVM solver): OVO LIBLINEAR (linear SVM solver): OVA
22 Comparisons (accuracy) For kernel SVM with RBF kernel (See A comparison of methods for multiclass support vector machines, 2002)
23 Comparisons (training time) For kernel SVM (with RBF kernel) (See A comparison of methods for multiclass support vector machines, 2002)
24 Methods Using Instance-wise Ranking Loss
25 Main idea OVA and OVO: decompose the problem by labels But good binary classifiers may not imply good multi-class prediction. Both OVA and OVO decompose the problem into individual labels they cannot capture the ranking information Ranking based approaches directly minimizes the ranking loss: For multiclass classification, the score of y i should be larger than other labels For multilabel classification, the score of Y i should be larger than other labels Solve one combined optimization problem by minimizing the ranking loss
26 Main idea For simplicity, we assume a linear model Model parameters: w 1,..., w L For each data point x, compute the decision value for each label: Prediction: w T 1 x, w T 2 x,..., w T L x y = arg max w T t t x For training data x i, y i is the true label, so we want y i arg max w T t t x i i
27 Main idea Question: how to define the distance between XW and Y
28 Weston-Watkins Formulation Proposed in Weston and Watkins, Multi-class support vector machines. In ESANN, min {w t},{ξi t} 2 L w t 2 + C t=1 n L i=1 t=1 s.t. w T y i x i w T t x i 1 ξ t i, ξ t i ξ t i 0 t y i, i = 1,..., n If point i is in class y i, for any other labels (t y i ), we want w T y i x i w T t x i 1 or we pay a penalty ξ t i Prediction: f (x) = arg max w T t t x i
29 Weston-Watkins Formulation (dual) The dual problem of Weston-Watkins formulation: 1 min {α t} 2 L w t (α) + t=1 n αi t i=1 t y i s.t.0 α t i C, t y i, i = 1,..., n α 1,..., α L R n : dual variables for each label w t (α) = i αt i x i, and α y i i = t y i αi t Can be solved by dual (block) coordinate descent
30 Crammer-Singer Formulation Proposed in Carmmer and Singer, On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, min {w t},{ξi t} 2 L w t 2 + C t=1 n i=1 ξ i s.t. w T y i x i w T t x i 1 ξ i, ξ i 0 i = 1,..., n t y i, i = 1,..., n If point i is in class y i, for any other labels (t y i ), we want w T y i x i w T t x i 1 For each point i, we only pay the largest penalty Prediction: f (x) = arg max w T t t x i
31 Crammer-Singer Formulation (dual) The dual problem of Crammer-Singer formulation: 1 L min w t (α) + {α t} 2 t=1 ( s.t. αi t C, t, t n αi t i=1 t y i ) αi t = 0, i = 1, 2,..., n α 1,..., α L R n : dual variables for each label w t (α) = i αt i x i, and α y i i = t y i αi t Can be solved by dual (block) coordinate descent
32 Comparisons (See A Sequential Dual Method for Large Scale Multi-Class Linear SVMs, KDD 2008)
33 Scaling to huge number of labels Extreme Classification
34 Challenges: large number of categories Multi-label (or multi-class) classification with large number of labels Image classification > labels Recommending tags for articles: millions of labels (tags)
35 Challenges: large number of categories Consider a problem with 1 million labels (L = 1, 000, 000) One versus all: reduce to 1 million binary problems Training: 1 million binary classification problems. Need 694 days if each binary problem can be solved in 1 minute Model size: 1 million models. Need 1 TB if each model requires 1MB. Prediction one testing data: 1 million binary prediction Need 1000 secs if each binary prediction needs 10 3 secs.
36 Extreme Classification Datasets See downloads/xc/xmlrepository.html
37 Several Approaches Today we will discuss: Label space reduction by Compressed Sensing (CS) Feature space reduction (PCA) Supervised latent feature model by matrix factorization Other approaches: Tree-based algorithm (random forest) Neural network?
38 Compressed Sensing If A R n d, w R d, y R n, and Given A and y, can we recover w? Aw = y
39 Compressed Sensing If A R n d, w R d, y R n, and Aw = y Given A and y, can we recover w? When n d: Usually yes, because number of constraints number of variable (over-determined linear system)
40 Compressed Sensing If A R n d, w R d, y R n, and Aw = y Given A and y, can we recover w? When n d: No, because number of constraints number of variable (high dimensional regression, under-determined linear system,... )
41 Compressed Sensing However, if we know w is a sparse vector (with s nonozero elements), and A satisfies certain condition, then we can recover w even in the high dimensional setting.
42 Compressed Sensing Conditions: w is s-sparse and A satisfies Restricted Isometry Property (RIP) Example: all entries of A are generated i.i.d. from Gaussian N(0, σ)... w can be recoverd by O(s log d) samples Lasso (l 1 -regularized linear regression) Orthogonal Mathcing Pursuit (OMP)... How is this related to multilabel classification?
43 Label Space Reduction by Compressed Sensing Proposed in Hsu et al., Multi-Label Prediction via Compressed Sensing. In NIPS Main idea: reduce the label space from R L to R k, where k L z i = My i where M R k L If we can recover y i by knowing z i and M, then: we only need to learn a function to map x i to z i : f (x i ) z i
44 Label Space Reduction by Compressed Sensing Proposed in Hsu et al., Multi-Label Prediction via Compressed Sensing. In NIPS Main idea: reduce the label space from R L to R k, where k L z i = My i where M R k L If we can recover y i by knowing z i and M, then: we only need to learn a function to map x i to z i : f (x i ) z i By compressed sensing: y i can be recovered given x i and M even when k = O(s log L) s is the number of nonzeroes in y i : usually very small in practice
45 Label Space Reduction by Compressed Sensing Training: Step 1: Construct M by i.i.d. Gaussian Step 2: Compute z i = My i for all i = 1, 2,..., n Step 3: Learn a function f such that f (x i ) z i Prediction for a test sample x Step 1: compute z = f (x) Step 2: compute y R L by solving a Lasso problem Step 3: Threshold y to give the prediction
46 Label Space Reduction by Compressed Sensing Reduce the label size from L to O(s log L) Drawbacks: 1 Slow prediction time (need to solve a Lasso problem for every prediction) 2 Large error in practice
47 Another way to understand this algorithm X R n d : data matrix, each row is a data point x i Y R n L : label matrix, each row is y i M R L k : Gaussian random matrix Goal: find a U matrix such that XU YM
48 Another way to understand this algorithm X R n d : data matrix, each row is a data point x i Y R n L : label matrix, each row is y i M R L k : Gaussian random matrix M : psuedo inverse of M Goal: find a U matrix such that XUM YMM
49 Feature space reduction Another way to improve speed: reduce the feature space Dimension reduction from d dimensional space to k dimensional space x i Nx i = x i, where N R k d and k d Matrix form: X = XN T Multilabel learning on the reduced feature space: Learn V such that X V Y Time complexity: reduced by a factor of n/k Several ways to choose N: PCA, random projection,...
50 Another way to understand this algorithm X R n d : data matrix, each row is a data point x i Y R n L : label matrix, each row is y i N R k d : matrix for dimensional reduction Goal: find a V matrix such that XN T V Y
51 Supervised latent space model (by matrix factorization) Label space reduction: find U such that X UV T Y Feature space reduction: find V such that How to improve? XUV T Y Find best U and V simultaneously Proposed in Chen and Lin, Feature-aware label space dimension reduction for multi-label classification. In NIPS Xu et al., Speedup Matrix Completion with Side Information: Application to Multi-Label Learning. In NIPS Yu et al., Large-scale Multi-label Learning with Missing Labels. In ICML 2014.
52 Low Rank Modeling for Multilabel Classification Let X R n d be the data matrix, Y R n L be the 0-1 label matrix Find U R d k and V R L k such that Y XUV T Obtain U, V by solving the following optimization problem: min U R d k,v R L k Y XUV T 2 F + λ 1 U 2 F + λ 2 V 2 F where λ 1, λ 2 are the regularization parameters Solve by alternating minimization.
53 Low Rank Modeling for Multilabel Classification Time complexity: Overall: O(nkL + ndk) per iteration Original time complexity: O(ndL) per iteration
54 Extensions Extend to general loss function: min dis(y ij, (XUV T ) ij ) + λ 1 U 2 U R d k,v R L k F + λ 2 V 2 F i,j Can handle problems with missing data: min dis(y ij, (XUV T ) ij ) + λ 1 U 2 U R d k,v R L k F + λ 2 V 2 F i,j Ω
55 Model size and Prediction time Model size: Low-rank model: O(dk + kl) Classical multi-label (OVA): O(dL) Prediction time: Low-rank model: O(dk + kl) per sample Classical multi-label (OVA): O(dL) per sample
56 More on prediction time MIPS Both OVA and Low-rank model can be reduced to MIPS (Maximum Inner Product Search) problem. Given a collection of n candidate vectors {h j R k : 1,, n} and a query point h, find argmin hj T h j OVA: L candidate vectors, each one is a d-dimensional vector Low-rank extreme classification: L candidate vectors, each one is a k-dimensional vector Matrix Completion: n (number of items) vectors, each one is a k-dimensional vector (Will discuss in detail later)
57 Other extreme classification solvers Random forest: Prabhu and Varma, FastXML: A Fast, Accurate and Stable Tree-classifier for extreme Multi-label Learning. In KDD Jain et al., Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking and Other Missing Label Applications. In KDD Local Embedding Learning: Bhatia et al., Sparse Local Embeddings for Extreme Multi-label Classification. NIPS, Low-rank modeling: Weston et al., Label Partitioning for Sublinear Ranking. ICML Yu et al., Large-scale Multi-label Learning with Missing Labels. ICML Crammer-Singer form: Yen et al., PD-Sparse: A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification. ICML 2016.
58 Coming up Questions?
ECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)
More informationLarge-scale Collaborative Ranking in Near-Linear Time
Large-scale Collaborative Ranking in Near-Linear Time Liwei Wu Depts of Statistics and Computer Science UC Davis KDD 17, Halifax, Canada August 13-17, 2017 Joint work with Cho-Jui Hsieh and James Sharpnack
More informationLecture 18: Multiclass Support Vector Machines
Fall, 2017 Outlines Overview of Multiclass Learning Traditional Methods for Multiclass Problems One-vs-rest approaches Pairwise approaches Recent development for Multiclass Problems Simultaneous Classification
More informationExploiting Primal and Dual Sparsity for Extreme Classification 1
Exploiting Primal and Dual Sparsity for Extreme Classification 1 Ian E.H. Yen Joint work with Xiangru Huang, Kai Zhong, Pradeep Ravikumar and Inderjit Dhillon Machine Learning Department Carnegie Mellon
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationCS-E4830 Kernel Methods in Machine Learning
CS-E4830 Kernel Methods in Machine Learning Lecture 5: Multi-class and preference learning Juho Rousu 11. October, 2017 Juho Rousu 11. October, 2017 1 / 37 Agenda from now on: This week s theme: going
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationIntroduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation
More information1 Training and Approximation of a Primal Multiclass Support Vector Machine
1 Training and Approximation of a Primal Multiclass Support Vector Machine Alexander Zien 1,2 and Fabio De Bona 1 and Cheng Soon Ong 1,2 1 Friedrich Miescher Lab., Max Planck Soc., Spemannstr. 39, Tübingen,
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Hsuan-Tien Lin Learning Systems Group, California Institute of Technology Talk in NTU EE/CS Speech Lab, November 16, 2005 H.-T. Lin (Learning Systems Group) Introduction
More informationSupport Vector Machine. Industrial AI Lab.
Support Vector Machine Industrial AI Lab. Classification (Linear) Autonomously figure out which category (or class) an unknown item should be categorized into Number of categories / classes Binary: 2 different
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Andreas Maletti Technische Universität Dresden Fakultät Informatik June 15, 2006 1 The Problem 2 The Basics 3 The Proposed Solution Learning by Machines Learning
More information6.036 midterm review. Wednesday, March 18, 15
6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that
More informationFoundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Real-world problems often have multiple classes: text, speech,
More informationSQL-Rank: A Listwise Approach to Collaborative Ranking
SQL-Rank: A Listwise Approach to Collaborative Ranking Liwei Wu Depts of Statistics and Computer Science UC Davis ICML 18, Stockholm, Sweden July 10-15, 2017 Joint work with Cho-Jui Hsieh and James Sharpnack
More informationLinear and Logistic Regression. Dr. Xiaowei Huang
Linear and Logistic Regression Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Two Classical Machine Learning Algorithms Decision tree learning K-nearest neighbor Model Evaluation Metrics
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction
More informationMulticlass Classification-1
CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass
More informationFoundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification page 2 Motivation Real-world problems often have multiple classes:
More informationReducing Multiclass to Binary: A Unifying Approach for Margin Classifiers
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning
More informationAn Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge
An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge Ming-Feng Tsai Department of Computer Science University of Singapore 13 Computing Drive, Singapore Shang-Tse Chen Yao-Nan Chen Chun-Sung
More informationPU Learning for Matrix Completion
Cho-Jui Hsieh Dept of Computer Science UT Austin ICML 2015 Joint work with N. Natarajan and I. S. Dhillon Matrix Completion Example: movie recommendation Given a set Ω and the values M Ω, how to predict
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationCompressed Sensing and Neural Networks
and Jan Vybíral (Charles University & Czech Technical University Prague, Czech Republic) NOMAD Summer Berlin, September 25-29, 2017 1 / 31 Outline Lasso & Introduction Notation Training the network Applications
More informationTraining Support Vector Machines: Status and Challenges
ICML Workshop on Large Scale Learning Challenge July 9, 2008 Chih-Jen Lin (National Taiwan Univ.) 1 / 34 Training Support Vector Machines: Status and Challenges Chih-Jen Lin Department of Computer Science
More informationSupport Vector Machine. Industrial AI Lab. Prof. Seungchul Lee
Support Vector Machine Industrial AI Lab. Prof. Seungchul Lee Classification (Linear) Autonomously figure out which category (or class) an unknown item should be categorized into Number of categories /
More informationKernel methods, kernel SVM and ridge regression
Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;
More informationLecture 3: Multiclass Classification
Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in
More informationJeff Howbert Introduction to Machine Learning Winter
Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable
More informationLearning SVM Classifiers with Indefinite Kernels
Learning SVM Classifiers with Indefinite Kernels Suicheng Gu and Yuhong Guo Dept. of Computer and Information Sciences Temple University Support Vector Machines (SVMs) (Kernel) SVMs are widely used in
More informationA Dual Coordinate Descent Method for Large-scale Linear SVM
Cho-Jui Hsieh b92085@csie.ntu.edu.tw Kai-Wei Chang b92084@csie.ntu.edu.tw Chih-Jen Lin cjlin@csie.ntu.edu.tw Department of Computer Science, National Taiwan University, Taipei 106, Taiwan S. Sathiya Keerthi
More informationSupport Vector and Kernel Methods
SIGIR 2003 Tutorial Support Vector and Kernel Methods Thorsten Joachims Cornell University Computer Science Department tj@cs.cornell.edu http://www.joachims.org 0 Linear Classifiers Rules of the Form:
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationDiscriminative Learning and Big Data
AIMS-CDT Michaelmas 2016 Discriminative Learning and Big Data Lecture 2: Other loss functions and ANN Andrew Zisserman Visual Geometry Group University of Oxford http://www.robots.ox.ac.uk/~vgg Lecture
More informationSupport Vector Machines.
Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel
More informationCSC 411 Lecture 17: Support Vector Machine
CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers Regression Input: training
More informationMachine learning for pervasive systems Classification in high-dimensional spaces
Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version
More informationA Parallel SGD method with Strong Convergence
A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,
More informationMULTIPLEKERNELLEARNING CSE902
MULTIPLEKERNELLEARNING CSE902 Multiple Kernel Learning -keywords Heterogeneous information fusion Feature selection Max-margin classification Multiple kernel learning MKL Convex optimization Kernel classification
More informationInfinite Ensemble Learning with Support Vector Machinery
Infinite Ensemble Learning with Support Vector Machinery Hsuan-Tien Lin and Ling Li Learning Systems Group, California Institute of Technology ECML/PKDD, October 4, 2005 H.-T. Lin and L. Li (Learning Systems
More informationIntroduction to Logistic Regression and Support Vector Machine
Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel
More informationIncremental and Decremental Training for Linear Classification
Incremental and Decremental Training for Linear Classification Authors: Cheng-Hao Tsai, Chieh-Yen Lin, and Chih-Jen Lin Department of Computer Science National Taiwan University Presenter: Ching-Pei Lee
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationSupport Vector Machine I
Support Vector Machine I Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative Please use piazza. No emails. HW 0 grades are back. Re-grade request for one week. HW 1 due soon. HW
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationWorking Set Selection Using Second Order Information for Training SVM
Working Set Selection Using Second Order Information for Training SVM Chih-Jen Lin Department of Computer Science National Taiwan University Joint work with Rong-En Fan and Pai-Hsuen Chen Talk at NIPS
More informationSupport Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM
1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University
More informationDistributed Box-Constrained Quadratic Optimization for Dual Linear SVM
Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descent
Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent KDD 2011 Rainer Gemulla, Peter J. Haas, Erik Nijkamp and Yannis Sismanis Presenter: Jiawen Yao Dept. CSE, UT Arlington 1 1
More informationA Magiv CV Theory for Large-Margin Classifiers
A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector
More informationRandom Features for Large Scale Kernel Machines
Random Features for Large Scale Kernel Machines Andrea Vedaldi From: A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Proc. NIPS, 2007. Explicit feature maps Fast algorithm for
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationParallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)
More informationLearning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute
More informationA Unified Algorithm for One-class Structured Matrix Factorization with Side Information
A Unified Algorithm for One-class Structured Matrix Factorization with Side Information Hsiang-Fu Yu University of Texas at Austin rofuyu@cs.utexas.edu Hsin-Yuan Huang National Taiwan University momohuang@gmail.com
More informationClick Prediction and Preference Ranking of RSS Feeds
Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationSupport Vector Machines
Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998) Support Vector Machines Machine Learning 10701/15781
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting
More informationStreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory
StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory S.V. N. (vishy) Vishwanathan Purdue University and Microsoft vishy@purdue.edu October 9, 2012 S.V. N. Vishwanathan (Purdue,
More informationAnnouncements - Homework
Announcements - Homework Homework 1 is graded, please collect at end of lecture Homework 2 due today Homework 3 out soon (watch email) Ques 1 midterm review HW1 score distribution 40 HW1 total score 35
More informationEfficient Maximum Margin Clustering
Dept. Automation, Tsinghua Univ. MLA, Nov. 8, 2008 Nanjing, China Outline 1 2 3 4 5 Outline 1 2 3 4 5 Support Vector Machine Given X = {x 1,, x n }, y = (y 1,..., y n ) { 1, +1} n, SVM finds a hyperplane
More informationCOMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection
COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection Instructor: Herke van Hoof (herke.vanhoof@cs.mcgill.ca) Based on slides by:, Jackie Chi Kit Cheung Class web page:
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationKaggle.
Administrivia Mini-project 2 due April 7, in class implement multi-class reductions, naive bayes, kernel perceptron, multi-class logistic regression and two layer neural networks training set: Project
More informationSupport Vector Machines and Kernel Methods
2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationOutline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22
Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems
More informationMachine Learning Basics
Security and Fairness of Deep Learning Machine Learning Basics Anupam Datta CMU Spring 2019 Image Classification Image Classification Image classification pipeline Input: A training set of N images, each
More informationSupport Vector Machines
Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot, we get creative in two
More informationA Tutorial on Support Vector Machine
A Tutorial on School of Computing National University of Singapore Contents Theory on Using with Other s Contents Transforming Theory on Using with Other s What is a classifier? A function that maps instances
More informationLearning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31
Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking Dengyong Zhou zhou@tuebingen.mpg.de Dept. Schölkopf, Max Planck Institute for Biological Cybernetics, Germany Learning from
More informationA Least Squares Formulation for Canonical Correlation Analysis
A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University Motivation Canonical Correlation
More informationVariables. Cho-Jui Hsieh The University of Texas at Austin. ICML workshop on Covariance Selection Beijing, China June 26, 2014
for a Million Variables Cho-Jui Hsieh The University of Texas at Austin ICML workshop on Covariance Selection Beijing, China June 26, 2014 Joint work with M. Sustik, I. Dhillon, P. Ravikumar, R. Poldrack,
More informationhttps://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:
More informationCS 188: Artificial Intelligence. Outline
CS 188: Artificial Intelligence Lecture 21: Perceptrons Pieter Abbeel UC Berkeley Many slides adapted from Dan Klein. Outline Generative vs. Discriminative Binary Linear Classifiers Perceptron Multi-class
More informationBig & Quic: Sparse Inverse Covariance Estimation for a Million Variables
for a Million Variables Cho-Jui Hsieh The University of Texas at Austin NIPS Lake Tahoe, Nevada Dec 8, 2013 Joint work with M. Sustik, I. Dhillon, P. Ravikumar and R. Poldrack FMRI Brain Analysis Goal:
More informationIntroduction to Machine Learning Midterm Exam
10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but
More informationClassification and Pattern Recognition
Classification and Pattern Recognition Léon Bottou NEC Labs America COS 424 2/23/2010 The machine learning mix and match Goals Representation Capacity Control Operational Considerations Computational Considerations
More informationRandomized Algorithms
Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models
More informationOutline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space
to The The A s s in to Fabio A. González Ph.D. Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 2, 2009 to The The A s s in 1 Motivation Outline 2 The Mapping the
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationLecture 7: Kernels for Classification and Regression
Lecture 7: Kernels for Classification and Regression CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 15, 2011 Outline Outline A linear regression problem Linear auto-regressive
More informationSupport Vector Machines and Kernel Methods
Support Vector Machines and Kernel Methods Chih-Jen Lin Department of Computer Science National Taiwan University Talk at International Workshop on Recent Trends in Learning, Computation, and Finance,
More informationFrom Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018
From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction
More informationMLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT
MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net
More informationLarge-Margin Thresholded Ensembles for Ordinal Regression
Large-Margin Thresholded Ensembles for Ordinal Regression Hsuan-Tien Lin and Ling Li Learning Systems Group, California Institute of Technology, U.S.A. Conf. on Algorithmic Learning Theory, October 9,
More informationMIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco
MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 08: Sparsity Based Regularization Lorenzo Rosasco Learning algorithms so far ERM + explicit l 2 penalty 1 min w R d n n l(y
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationSupport Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Support Vector Machines CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 A Linearly Separable Problem Consider the binary classification
More informationLinear Classifiers IV
Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers IV Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic
More informationMidterm: CS 6375 Spring 2015 Solutions
Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an
More informationClassification and Support Vector Machine
Classification and Support Vector Machine Yiyong Feng and Daniel P. Palomar The Hong Kong University of Science and Technology (HKUST) ELEC 5470 - Convex Optimization Fall 2017-18, HKUST, Hong Kong Outline
More information