Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)
|
|
- Margery York
- 5 years ago
- Views:
Transcription
1 Metric Embedding of Task-Specific Similarity Greg Shakhnarovich Brown University joint work with Trevor Darrell (MIT) August 9, 2006
2 Task-specific similarity A toy example:
3 Task-specific similarity A toy example:
4 Task-specific similarity A toy example: Norm Angle
5 The problem Learn similarity from examples of what is similar [or not]. Binary similarity: for x, y X S(x, y) = { + if x and y are similar, if they are dissimilar. Two goals in mind: Similarity detection: judge whether two entities are similar. Similarity search: given a query entity and a database, find examples in a database that are similar to the query. Our approach: learn an embedding of the data into a space where similarity corresponds to a simple distance. 2
6 Task-specific similarity Articulated pose: 3
7 Task-specific similarity Visual similarity of image patches: 4
8 Related prior work Learning parametrized distance metric [Xing et al; Roweis], such as Mahalanobis distances. Lots of work on low-dimensional graph embedding (MDS, LLE,... ) - but unclear how to generalize without relying on distance in X. Embedding known distance [Athitsos et al]: assumes known distance, uses embedding to approximate/speed up. DistBoost [Hertz et al]: learning distance for classification / clustering setup. Locality sensitive hashing: fast similarity search. 5
9 Locality Sensitive Hashing [Indyk et al] Algorithm for finding a (ɛ, r)-neighbor of x 0 with high probability in sublinear time O ( N /(+ɛ)). Index the data by l random hash functions, and only search the union of the l buckets where the query falls: 6
10 Locality Sensitive Hashing [Indyk et al] Algorithm for finding a (ɛ, r)-neighbor of x 0 with high probability in sublinear time O ( N /(+ɛ)). Index the data by l random hash functions, and only search the union of the l buckets where the query falls: 0 6
11 Locality Sensitive Hashing [Indyk et al] Algorithm for finding a (ɛ, r)-neighbor of x 0 with high probability in sublinear time O ( N /(+ɛ)). Index the data by l random hash functions, and only search the union of the l buckets where the query falls: 0 6
12 Locality Sensitive Hashing [Indyk et al] Algorithm for finding a (ɛ, r)-neighbor of x 0 with high probability in sublinear time O ( N /(+ɛ)). Index the data by l random hash functions, and only search the union of the l buckets where the query falls: 0 0 6
13 Locality Sensitive Hashing [Indyk et al] Algorithm for finding a (ɛ, r)-neighbor of x 0 with high probability in sublinear time O ( N /(+ɛ)). Index the data by l random hash functions, and only search the union of the l buckets where the query falls: 0 0 6
14 Locality Sensitive Hashing [Indyk et al] Algorithm for finding a (ɛ, r)-neighbor of x 0 with high probability in sublinear time O ( N /(+ɛ)). Index the data by l random hash functions, and only search the union of the l buckets where the query falls: 0 0 6
15 Locality Sensitive Hashing [Indyk et al] Algorithm for finding a (ɛ, r)-neighbor of x 0 with high probability in sublinear time O ( N /(+ɛ)). Index the data by l random hash functions, and only search the union of the l buckets where the query falls: 0 0 6
16 Locality Sensitive Hashing [Indyk et al] Algorithm for finding a (ɛ, r)-neighbor of x 0 with high probability in sublinear time O ( N /(+ɛ)). Index the data by l random hash functions, and only search the union of the l buckets where the query falls: 0 0 6
17 Locality Sensitive Hashing [Indyk et al] Algorithm for finding a (ɛ, r)-neighbor of x 0 with high probability in sublinear time O ( N /(+ɛ)). Index the data by l random hash functions, and only search the union of the l buckets where the query falls:
18 Locality Sensitive Hashing [Indyk et al] Algorithm for finding a (ɛ, r)-neighbor of x 0 with high probability in sublinear time O ( N /(+ɛ)). Index the data by l random hash functions, and only search the union of the l buckets where the query falls:
19 Locality Sensitive Hashing [Indyk et al] Algorithm for finding a (ɛ, r)-neighbor of x 0 with high probability in sublinear time O ( N /(+ɛ)). Index the data by l random hash functions, and only search the union of the l buckets where the query falls:
20 Locality Sensitive Hashing [Indyk et al] Algorithm for finding a (ɛ, r)-neighbor of x 0 with high probability in sublinear time O ( N /(+ɛ)). Index the data by l random hash functions, and only search the union of the l buckets where the query falls:
21 Locality Sensitive Hashing [Indyk et al] Algorithm for finding a (ɛ, r)-neighbor of x 0 with high probability in sublinear time O ( N /(+ɛ)). Index the data by l random hash functions, and only search the union of the l buckets where the query falls:
22 Locality sensitive hashing [Indyk et al] A family H of functions is locality sensitive if P h U[H] (h(x 0 ) = h(x) x 0 x r) p, P h U[H] (h(x 0 ) = h(x) x 0 x R) p 2. Uses the gap between TP and FP rates; amplified by concatenating functions into a hash key. Projections on random lines are locality sensitive w.r.t. L p norms, p 2 [Gionis et al, Datar et al]. 7
23 How is this relevant? LSH is excellent if L p is all we want. L p may not be a suitable proxy for S: we may Waste lots of bits on irrelevant features; Miss pairs similar under S but not close w.r.t. L p If we know what S is, may be able to analytically design embedding of X into L space [Thaper&Indyk, Grauman&Darrell]. We will instead learn LSH-style binary functions that fit S as conveyed by examples. 8
24 Our approach Given: pairs of similar [and pairs of dissimilar] examples, based on the hidden binary similarity S. Two related tasks: Similarity judgment: S(x, y) =? Given a query x 0 ; need to find {x i : S(x i, x 0 ) = +}. Our solution to both problems: a similarity sensitive embedding H(x) = [α h (x),..., α M h M (x)] ; We will learn h m (x) {0, } and α m. 9
25 Desired embedding properties Embedding H(x) = [α h (x),..., α M h M (x)]: Rely on L (= weighted Hamming) distance H(x ) H(x 2 ) = M m= α m h m (x ) h m (x 2 ) H is similarity sensitive: for some R, want high P x,x 2 p(x)( H(x ) H(x 2 ) R S(x, x 2 ) = +), low P x,x 2 p(x)( H(x ) H(x 2 ) R S(x, x 2 ) = ). H(x) H(y) is a proxy for S. 0
26 Projection-based classifiers For a projection f : X R, consider, for some T R, h(x; f, T ) = { if f(x) T 0 if f(x) > T. This defines a simple similarity classifier of pairs: c(x, y; f, T ) = + h(x; f, T ) = h(y; f, T ) q y z b a h(x; f, T ) = 0 T h(x; f, T ) = f(x) c(q, y; f, T ) = c(q, z; f, T ) = +, c(q, a; f, T ) = c(q, b; f, T ) =.
27 Selecting the threshold Algorithm for selecting the threshold based on similar/dissimilar pairs: For a moment, we focus on N positive pairs only. Sort the 2N values of f(x). Need to check at most 2N + values of T, and count the number of pairs that are dissected (a bad event). f(x) 2
28 Selecting the threshold Algorithm for selecting the threshold based on similar/dissimilar pairs: For a moment, we focus on N positive pairs only. Sort the 2N values of f(x). Need to check at most 2N + values of T, and count the number of pairs that are dissected (a bad event). 0 f(x) 2
29 Selecting the threshold Algorithm for selecting the threshold based on similar/dissimilar pairs: For a moment, we focus on N positive pairs only. Sort the 2N values of f(x). Need to check at most 2N + values of T, and count the number of pairs that are dissected (a bad event). 0 f(x) 2
30 Selecting the threshold Algorithm for selecting the threshold based on similar/dissimilar pairs: For a moment, we focus on N positive pairs only. Sort the 2N values of f(x). Need to check at most 2N + values of T, and count the number of pairs that are dissected (a bad event) f(x) 2
31 Selecting the threshold Also need to consider negative examples (dissimilar pairs), and estimate the gap: true positive rate TP minus false positive rate FP. POS NEG f(x) 3
32 Selecting the threshold Also need to consider negative examples (dissimilar pairs), and estimate the gap: true positive rate TP minus false positive rate FP POS NEG f(x)
33 Optimization of h(x; f, T ) Objective (TP FP gap): argmax f,t N + i= N c(x + i, x+ i ) j= c(x j, x j ) Parametric projection families, e.g. f(x) = j θ jx p (d j )xp 2 (d 2 j ) Soft versions of h and c make the gap differentiable w.r.t. θ, T : h(x; f, T ) = + exp (f(x) T )) c(x, y) = 4 (h(x) /2) (h(y) /2) 4
34 Ensemble classifiers A weighted embedding H(x) = [α h (x),..., α M h M (x)]. Each h m (x) defines a classifier c m. Together they form an ensemble classifier of similarity: C(x, y) = sgn ( M m= α m c m (x, y) We will construct the ensemble of bit-valued hs by a greedy algorithm based on AdaBoost (operting on the corresponding cs). ). 5
35 Boosting [Schapire et al] Assumes access to weak learner that can at every iteration m produce a classifier c m better than chance. Main idea: maintain a distribution W m (i) on the training examples, and update it according to the prediction of c m : If c m (x i ) is correct, then W m+ (i) goes down; Otherwise, W m+ (i) increases ( steering c m+ towards it.) Our examples are pairs, weak classifiers are thresholded projections. To evaluate threshold, we will calculate the weight of separated pairs, rather than count them. 6
36 BoostPro Given pairs (x () i, x (2) i ) labeled with l i = S(x () i, x (2) i ): : Initialize weights W (i), to uniform. 2: for all m =,..., M do 3: Find f, T = argmax f,t r m (f, T ) using gradient descent on r m (f, T ) = N i= W m (i)l i c m (x () i, x (2) i ). 4: Set h m h(x; f, T ). 5: Set α m (see Boosting papers.) 6: Update weights: W m+ (i) W m (i) exp ( ) l i c m (x () i, x (2) i ). 7
37 Similarity is a rare event In many domains: vast majority of possible pairs are negative. People s poses, image patches, documents,... A reasonable approximation of the sampling process: Independently draw x, y from the data distribution. f(x), f(y) drawn from p(f(x)). Label (x, y) negative. 8
38 Similarity is a rare event In many domains: vast majority of possible pairs are negative. People s poses, image patches, documents,... A reasonable approximation of the sampling process: Independently draw x, y from the data distribution. f(x), f(y) drawn from p(f(x)). Label (x, y) negative. T f( x) 8
39 Similarity is a rare event In many domains: vast majority of possible pairs are negative. People s poses, image patches, documents,... A reasonable approximation of the sampling process: Independently draw x, y from the data distribution. f(x), f(y) drawn from p(f(x)). Label (x, y) negative. π T ( π ) T π T = Pr(f(x) T ) T f( x) 8
40 Similarity is a rare event In many domains: vast majority of possible pairs are negative. People s poses, image patches, documents,... A reasonable approximation of the sampling process: Independently draw x, y from the data distribution. f(x), f(y) drawn from p(f(x)). Label (x, y) negative. 2 π T ( π T ) T f( x) π T = Pr(f(x) T ) 8
41 Similarity is a rare event In many domains: vast majority of possible pairs are negative. People s poses, image patches, documents,... A reasonable approximation of the sampling process: Independently draw x, y from the data distribution. f(x), f(y) drawn from p(f(x)). Label (x, y) negative. 2 2 π T ( π T ) T f( x) π T = Pr(f(x) T ) 8
42 Semi-supervised setup Given similar pairs and unlabeled x,..., x N. Estimate TP rate as before. FP rate: Estimate the CDF of f(x); let π T = ˆP (f(x) T ). Then FP = π 2 T + ( π T ) 2. Note: this means FP /2 [Ke et al]. 9
43 BoostPro in a semi-supervised setup Normal boosting assumes positive and negative examples. Intuition: each unlabeled example x i represents all possible pairs (x i, y); those are, w.h.p., negative under our assumption. Two distributions: W m (j) on positive pairs, as before. S m (i) on unlabeled (single) examples. Instead of c m (x () i, x (2) i ) use the expectation E y [c m (x i, y)] to calculate the objective and set the weights. 20
44 Semi-supervised boosting: details Probability of misclassifying a negative (x j, y): P j = h(x j ; f, T )π + ( h(x j ; f, T ))( π). E y [c(x j, y)] = P j (+) + ( P j ) ( ) = 2P j. Modified boosting objective: r = N p i= W (i)c(x () i, x (2) i ) N j= S(j)E y [c(x j, y)] = N p i= W (i)c(x () i, x (2) i ) N j= S j (2P j ). 2
45 Results: Toy problems M =00, trained on 600 unlabeled points +,000 similar pairs 22
46 Results: Toy problems 23
47 Results: UCI data sets Data Set L PSH BoostPro M MPG ± ± ± ± 20 CPU ± ± ± ± 48 Housing ± ± ± ± 28 Abalone ± ± ± ± 8 Census ± ± ± ± 0 Test error on regression benchmark data from UCI/Delve. Mean ± std. deviation of MSE using locally-weighted regression. The last column shows the values of M (dimension of embedding H). Similarity defined in terms of target function values. 24
48 Results: pose retrieval Input 3 nearest neighbors in H H built with semi-supervised BoostPro, on 200,000 examples; M =
49 Results: pose retrieval Recall for k-nn retrieval. For each value of k, the fraction of true k-nn w.r.t. pose within the k-nn w.r.t. an image-based similarity measure is plotted. Black dotted line: L on EDH. Red dash-dot: PSH. Blue solid: BoostPro, M =
50 Visual similarity of image patches Define two patches to be similar under any rotation and mild shift (± 4 pixels). Should be covered by many reasonable similarity definitions. 27
51 Descriptor : sparse overcomplete code Generative model of patches [Olshausen&Field] Very unstable under transformation, hence L is not a good proxy for similarity. With BoostPro, improvement in area under ROC from 0.56 to
52 Descriptor 2: SIFT Scale-Invariant Feature Transform [Lowe] Histogram of gradient magnitude and orientation within the region, normalized by orientation and at an appropriate scale. Discriminative (can not generate a patch). Designed specifically to make two similar patches match. 29
53 Results Area under ROC curve: L on SIFT L on H(SIFT)
54 Conclusions It is beneficial to learn similarity directly for the task rather than rely on the default distance. Key property of our approach: similarity search reduced to L neighbor search (thus can be done in sublinear time.) Most useful for: Regression and multi-label classification; When large amounts of labeled data are available; When L p distance is not a good proxy for similarity. 3
55 Open questions What kinds of similarity concepts can we learn? How do we explore the space of projections more efficiently? Factorization of similarity. Combining unsupervised and supervised similarity learning for image regions. 32
56 Questions?.. 33
The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.
The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment 1 Caramanis/Sanghavi Due: Thursday, Feb. 7, 2013. (Problems 1 and
More informationDecision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014
Decision Trees Machine Learning CSEP546 Carlos Guestrin University of Washington February 3, 2014 17 Linear separability n A dataset is linearly separable iff there exists a separating hyperplane: Exists
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationMachine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)
Machine Learning InstanceBased Learning (aka nonparametric methods) Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Non parametric CSE 446 Machine Learning Daniel Weld March
More informationArtificial Intelligence Roman Barták
Artificial Intelligence Roman Barták Department of Theoretical Computer Science and Mathematical Logic Introduction We will describe agents that can improve their behavior through diligent study of their
More informationRobotics 2 AdaBoost for People and Place Detection
Robotics 2 AdaBoost for People and Place Detection Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Wolfram Burgard v.1.0, Kai Arras, Oct 09, including material by Luciano Spinello and Oscar Martinez Mozos
More informationMachine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier
Machine Learning 10-701/15 701/15-781, 781, Spring 2008 Theory of Classification and Nonparametric Classifier Eric Xing Lecture 2, January 16, 2006 Reading: Chap. 2,5 CB and handouts Outline What is theoretically
More information6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL
6.867 Machine learning: lecture 2 Tommi S. Jaakkola MIT CSAIL tommi@csail.mit.edu Topics The learning problem hypothesis class, estimation algorithm loss and estimation criterion sampling, empirical and
More informationBayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI
Bayes Classifiers CAP5610 Machine Learning Instructor: Guo-Jun QI Recap: Joint distributions Joint distribution over Input vector X = (X 1, X 2 ) X 1 =B or B (drinking beer or not) X 2 = H or H (headache
More information18.9 SUPPORT VECTOR MACHINES
744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the
More informationSpectral Hashing: Learning to Leverage 80 Million Images
Spectral Hashing: Learning to Leverage 80 Million Images Yair Weiss, Antonio Torralba, Rob Fergus Hebrew University, MIT, NYU Outline Motivation: Brute Force Computer Vision. Semantic Hashing. Spectral
More informationInstance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationRobotics 2. AdaBoost for People and Place Detection. Kai Arras, Cyrill Stachniss, Maren Bennewitz, Wolfram Burgard
Robotics 2 AdaBoost for People and Place Detection Kai Arras, Cyrill Stachniss, Maren Bennewitz, Wolfram Burgard v.1.1, Kai Arras, Jan 12, including material by Luciano Spinello and Oscar Martinez Mozos
More informationECE 661: Homework 10 Fall 2014
ECE 661: Homework 10 Fall 2014 This homework consists of the following two parts: (1) Face recognition with PCA and LDA for dimensionality reduction and the nearest-neighborhood rule for classification;
More informationCS7267 MACHINE LEARNING
CS7267 MACHINE LEARNING ENSEMBLE LEARNING Ref: Dr. Ricardo Gutierrez-Osuna at TAMU, and Aarti Singh at CMU Mingon Kang, Ph.D. Computer Science, Kennesaw State University Definition of Ensemble Learning
More informationECE521 Lecture7. Logistic Regression
ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard
More informationReconnaissance d objetsd et vision artificielle
Reconnaissance d objetsd et vision artificielle http://www.di.ens.fr/willow/teaching/recvis09 Lecture 6 Face recognition Face detection Neural nets Attention! Troisième exercice de programmation du le
More informationStephen Scott.
1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training
More informationMachine Learning for Signal Processing Bayes Classification and Regression
Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For
More informationLinear Spectral Hashing
Linear Spectral Hashing Zalán Bodó and Lehel Csató Babeş Bolyai University - Faculty of Mathematics and Computer Science Kogălniceanu 1., 484 Cluj-Napoca - Romania Abstract. assigns binary hash keys to
More informationAnomaly Detection. Jing Gao. SUNY Buffalo
Anomaly Detection Jing Gao SUNY Buffalo 1 Anomaly Detection Anomalies the set of objects are considerably dissimilar from the remainder of the data occur relatively infrequently when they do occur, their
More informationLogic and machine learning review. CS 540 Yingyu Liang
Logic and machine learning review CS 540 Yingyu Liang Propositional logic Logic If the rules of the world are presented formally, then a decision maker can use logical reasoning to make rational decisions.
More information18.6 Regression and Classification with Linear Models
18.6 Regression and Classification with Linear Models 352 The hypothesis space of linear functions of continuous-valued inputs has been used for hundreds of years A univariate linear function (a straight
More informationLecture 4 Discriminant Analysis, k-nearest Neighbors
Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se
More informationEnsemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan
Ensemble Methods NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan How do you make a decision? What do you want for lunch today?! What did you have last night?! What are your favorite
More informationIntroduction to Machine Learning Midterm Exam
10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but
More informationRegularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline
Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function
More informationMetric-based classifiers. Nuno Vasconcelos UCSD
Metric-based classifiers Nuno Vasconcelos UCSD Statistical learning goal: given a function f. y f and a collection of eample data-points, learn what the function f. is. this is called training. two major
More informationA Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University
A Gentle Introduction to Gradient Boosting Cheng Li chengli@ccs.neu.edu College of Computer and Information Science Northeastern University Gradient Boosting a powerful machine learning algorithm it can
More informationEnsemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12
Ensemble Methods Charles Sutton Data Mining and Exploration Spring 2012 Bias and Variance Consider a regression problem Y = f(x)+ N(0, 2 ) With an estimate regression function ˆf, e.g., ˆf(x) =w > x Suppose
More informationLarge-Margin Thresholded Ensembles for Ordinal Regression
Large-Margin Thresholded Ensembles for Ordinal Regression Hsuan-Tien Lin (accepted by ALT 06, joint work with Ling Li) Learning Systems Group, Caltech Workshop Talk in MLSS 2006, Taipei, Taiwan, 07/25/2006
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More information10-701/ Machine Learning - Midterm Exam, Fall 2010
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam
More informationFace detection and recognition. Detection Recognition Sally
Face detection and recognition Detection Recognition Sally Face detection & recognition Viola & Jones detector Available in open CV Face recognition Eigenfaces for face recognition Metric learning identification
More informationCOMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017
COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University BOOSTING Robert E. Schapire and Yoav
More informationJeffrey D. Ullman Stanford University
Jeffrey D. Ullman Stanford University 3 We are given a set of training examples, consisting of input-output pairs (x,y), where: 1. x is an item of the type we want to evaluate. 2. y is the value of some
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationCS534 Machine Learning - Spring Final Exam
CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the
More informationOutline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual
Outline: Ensemble Learning We will describe and investigate algorithms to Ensemble Learning Lecture 10, DD2431 Machine Learning A. Maki, J. Sullivan October 2014 train weak classifiers/regressors and how
More informationOptimal Data-Dependent Hashing for Approximate Near Neighbors
Optimal Data-Dependent Hashing for Approximate Near Neighbors Alexandr Andoni 1 Ilya Razenshteyn 2 1 Simons Institute 2 MIT, CSAIL April 20, 2015 1 / 30 Nearest Neighbor Search (NNS) Let P be an n-point
More informationMachine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /
Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical
More informationSUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION
SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationNeural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17
3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural
More informationMIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE
MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on
More informationMulticlass Classification-1
CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass
More informationLOCALITY PRESERVING HASHING. Electrical Engineering and Computer Science University of California, Merced Merced, CA 95344, USA
LOCALITY PRESERVING HASHING Yi-Hsuan Tsai Ming-Hsuan Yang Electrical Engineering and Computer Science University of California, Merced Merced, CA 95344, USA ABSTRACT The spectral hashing algorithm relaxes
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationThe AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m
) Set W () i The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m m =.5 1 n W (m 1) i y i h(x i ; 2 ˆθ
More informationCOMP90051 Statistical Machine Learning
COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 2. Statistical Schools Adapted from slides by Ben Rubinstein Statistical Schools of Thought Remainder of lecture is to provide
More informationRandomized Algorithms
Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models
More informationSemi Supervised Distance Metric Learning
Semi Supervised Distance Metric Learning wliu@ee.columbia.edu Outline Background Related Work Learning Framework Collaborative Image Retrieval Future Research Background Euclidean distance d( x, x ) =
More informationStatistical learning. Chapter 20, Sections 1 4 1
Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete
More informationThe Power of Asymmetry in Binary Hashing
The Power of Asymmetry in Binary Hashing Behnam Neyshabur Yury Makarychev Toyota Technological Institute at Chicago Russ Salakhutdinov University of Toronto Nati Srebro Technion/TTIC Search by Image Image
More informationGlobal Scene Representations. Tilke Judd
Global Scene Representations Tilke Judd Papers Oliva and Torralba [2001] Fei Fei and Perona [2005] Labzebnik, Schmid and Ponce [2006] Commonalities Goal: Recognize natural scene categories Extract features
More informationAlgorithm-Independent Learning Issues
Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning
More informationNotion of Distance. Metric Distance Binary Vector Distances Tangent Distance
Notion of Distance Metric Distance Binary Vector Distances Tangent Distance Distance Measures Many pattern recognition/data mining techniques are based on similarity measures between objects e.g., nearest-neighbor
More informationCSE446: non-parametric methods Spring 2017
CSE446: non-parametric methods Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin and Luke Zettlemoyer Linear Regression: What can go wrong? What do we do if the bias is too strong? Might want
More informationSemi-Supervised Learning of Speech Sounds
Aren Jansen Partha Niyogi Department of Computer Science Interspeech 2007 Objectives 1 Present a manifold learning algorithm based on locality preserving projections for semi-supervised phone classification
More informationMachine Learning, Fall 2009: Midterm
10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationPerformance evaluation of binary classifiers
Performance evaluation of binary classifiers Kevin P. Murphy Last updated October 10, 2007 1 ROC curves We frequently design systems to detect events of interest, such as diseases in patients, faces in
More informationBoosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13
Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y
More informationNumerical Learning Algorithms
Numerical Learning Algorithms Example SVM for Separable Examples.......................... Example SVM for Nonseparable Examples....................... 4 Example Gaussian Kernel SVM...............................
More informationDiscriminative Learning and Big Data
AIMS-CDT Michaelmas 2016 Discriminative Learning and Big Data Lecture 2: Other loss functions and ANN Andrew Zisserman Visual Geometry Group University of Oxford http://www.robots.ox.ac.uk/~vgg Lecture
More informationIntelligent Systems:
Intelligent Systems: Undirected Graphical models (Factor Graphs) (2 lectures) Carsten Rother 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM Roadmap for next two lectures Definition
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationRepresenting Images Detecting faces in images
11-755 Machine Learning for Signal Processing Representing Images Detecting faces in images Class 5. 15 Sep 2009 Instructor: Bhiksha Raj Last Class: Representing Audio n Basic DFT n Computing a Spectrogram
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More information6.036 midterm review. Wednesday, March 18, 15
6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that
More informationClustering with k-means and Gaussian mixture distributions
Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2012-2013 Jakob Verbeek, ovember 23, 2012 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.12.13
More informationCS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines
CS4495/6495 Introduction to Computer Vision 8C-L3 Support Vector Machines Discriminative classifiers Discriminative classifiers find a division (surface) in feature space that separates the classes Several
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features Y target
More informationLecture 8. Instructor: Haipeng Luo
Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationName (NetID): (1 Point)
CS446: Machine Learning (D) Spring 2017 March 16 th, 2017 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationVisual Object Detection
Visual Object Detection Ying Wu Electrical Engineering and Computer Science Northwestern University, Evanston, IL 60208 yingwu@northwestern.edu http://www.eecs.northwestern.edu/~yingwu 1 / 47 Visual Object
More informationLarge-Margin Thresholded Ensembles for Ordinal Regression
Large-Margin Thresholded Ensembles for Ordinal Regression Hsuan-Tien Lin and Ling Li Learning Systems Group, California Institute of Technology, U.S.A. Conf. on Algorithmic Learning Theory, October 9,
More informationDirectly and Efficiently Optimizing Prediction Error and AUC of Linear Classifiers
Directly and Efficiently Optimizing Prediction Error and AUC of Linear Classifiers Hiva Ghanbari Joint work with Prof. Katya Scheinberg Industrial and Systems Engineering Department US & Mexico Workshop
More information9 Classification. 9.1 Linear Classifiers
9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive
More informationCS 231A Section 1: Linear Algebra & Probability Review
CS 231A Section 1: Linear Algebra & Probability Review 1 Topics Support Vector Machines Boosting Viola-Jones face detector Linear Algebra Review Notation Operations & Properties Matrix Calculus Probability
More informationClassification: The rest of the story
U NIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher
More informationLecture 14 October 16, 2014
CS 224: Advanced Algorithms Fall 2014 Prof. Jelani Nelson Lecture 14 October 16, 2014 Scribe: Jao-ke Chin-Lee 1 Overview In the last lecture we covered learning topic models with guest lecturer Rong Ge.
More informationWeb Search and Text Mining. Learning from Preference Data
Web Search and Text Mining Learning from Preference Data Outline Two-stage algorithm, learning preference functions, and finding a total order that best agrees with a preference function Learning ranking
More information... SPARROW. SPARse approximation Weighted regression. Pardis Noorzad. Department of Computer Engineering and IT Amirkabir University of Technology
..... SPARROW SPARse approximation Weighted regression Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Université de Montréal March 12, 2012 SPARROW 1/47 .....
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationBeyond Spatial Pyramids
Beyond Spatial Pyramids Receptive Field Learning for Pooled Image Features Yangqing Jia 1 Chang Huang 2 Trevor Darrell 1 1 UC Berkeley EECS 2 NEC Labs America Goal coding pooling Bear Analysis of the pooling
More informationMathematical Formulation of Our Example
Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot
More informationEEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1
EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle
More informationLearning from Examples
Learning from Examples Data fitting Decision trees Cross validation Computational learning theory Linear classifiers Neural networks Nonparametric methods: nearest neighbor Support vector machines Ensemble
More informationOpen Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers
JMLR: Workshop and Conference Proceedings vol 35:1 8, 014 Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers Balázs Kégl LAL/LRI, University
More informationCS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning
CS 446 Machine Learning Fall 206 Nov 0, 206 Bayesian Learning Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Bayesian Learning Naive Bayes Logistic Regression Bayesian Learning So far, we
More informationAdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology
AdaBoost S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology 1 Introduction In this chapter, we are considering AdaBoost algorithm for the two class classification problem.
More informationLecture 5 Neural models for NLP
CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm
More informationData Mining und Maschinelles Lernen
Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting
More informationVariations of Logistic Regression with Stochastic Gradient Descent
Variations of Logistic Regression with Stochastic Gradient Descent Panqu Wang(pawang@ucsd.edu) Phuc Xuan Nguyen(pxn002@ucsd.edu) January 26, 2012 Abstract In this paper, we extend the traditional logistic
More information