Active Learning for Sparse Bayesian Multilabel Classification
|
|
- Emory Bridges
- 5 years ago
- Views:
Transcription
1 Active Learning for Sparse Bayesian Multilabel Classification Deepak Vasisht, MIT & IIT Delhi Andreas Domianou, University of Sheffield Manik Varma, MSR, India Ashish Kapoor, MSR, Redmond
2 Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels.
3 Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels.
4 Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels. x i 2 R d Feature vector, d: dimension of the feature space
5 Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels. x i 2 R d Feature vector, d: dimension of the feature space Iraq Flowers Human Brick Sea Sun Sky
6 Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels. x i 2 R d Feature vector, d: dimension of the feature space Iraq Flowers Human Brick Sea Sun Sky
7 Training
8 Training
9 Training WikiLSHTC has 325k labels. Good luck with that!!
10 Training Is Expensive Training data can also be very expensive, like genomic data, chemical data Getting each label incurs additional cost
11 Training Is Expensive Training data can also be very expensive, like genomic data, chemical data Getting each label incurs additional cost Need to reduce the required training data.
12 1 2 3 Active Learning Labels Iraq Flowers Sun Sky L Datapoints N
13 1 2 3 Active Learning Labels Iraq Flowers Sun Sky L 1 0 Datapoints N
14 1 2 3 Active Learning Labels Iraq Flowers Sun Sky L 1 0 Datapoints N
15 1 2 3 Active Learning Labels Iraq Flowers Sun Sky L Datapoints N Which data points should I label?
16 1 2 3 Active Learning Labels Iraq Flowers Sun Sky L Datapoints For a particular datapoint, N which labels should I reveal?
17 1 2 3 Active Learning Labels Iraq Flowers Sun Sky L Datapoints Can I choose datapoint-label pairs N to annotate?
18 In this talk An active learner for Multi-label classification that: Answers all your questions Is Computationally Cheap Is Non myopic and near-optimal Incorporates label sparsity Achieves higher accuracy than state-of-the-art
19 Classification
20 Classification Model* x i Labels y 1 i y 2 i y 3 i y L i *Kapoor et al, NIPS 2012
21 Classification Model* x i Compressed Space z 1 i z 2 i z k i Labels y 1 i y 2 i y 3 i y L i *Kapoor et al, NIPS 2012
22 Classification Model* x i W Compressed Space z 1 i z 2 i z k i Labels y 1 i y 2 i y 3 i y L i *Kapoor et al, NIPS 2012
23 Classification Model* x i W Compressed Space z 1 i z 2 i z k i Labels y 1 i y 2 i y 3 i y L i Sparsity 1 i 2 i 3 i L i *Kapoor et al, NIPS 2012
24 Classification Model: Potentials x i f xi (W, z i )=e W T xi z i W z 1 i z 2 i z k i y 1 i y 2 i y 3 i y L i 1 i 2 i 3 i L i
25 Classification Model: Potentials x i f xi (W, z i )=e W T xi z i W y i z i 2 g (y i,z i )=e 2 2 z 1 i z 2 i z k i y 1 i y 2 i y 3 i y L i 1 i 2 i 3 i L i
26 Classification Model: Priors x i W z 1 i z 2 i z k i y j i N(0, 1 j i ) y 1 i y 2 i y 3 i y L i 1 i 2 i 3 i L i
27 Classification Model: Priors x i W z 1 i z 2 i z k i y j i N(0, 1 j i ) y 1 i y 2 i y 3 i y L i j i ( j i ; a 0,b 0 ) 1 i 2 i 3 i L i
28 Sparsity Priors a 0 = 10 6,b 0 = 10 6
29 Classification Model x i W f xi (W, z i )=e W T xi z i y i z i 2 g (y i,z i )=e 2 2 z 1 i z 2 i z k i y j i N(0, 1 j i ) y 1 i y 2 i y 3 i y L i j i ( j i ; a 0,b 0 ) 1 i 2 i 3 i L i
30 Classification Model x i W f xi (W, z i )=e W T xi z i y i z i 2 g (y i,z i )=e 2 2 z 1 i z 2 i z k i y j i N(0, 1 j i ) yi 1 yi 2 yi 3 yi L Problem: Exact inference is intractable. j i ( j i ; a 0,b 0 ) 1 i 2 i 3 i L i
31 Inference: Variational Bayes x i W z 1 i z 2 i z k i y 1 i y 2 i y 3 i y L i 1 i 2 i 3 i L i
32 Inference: Variational Bayes x i W Approximate Gaussian z 1 i z 2 i z k i y 1 i y 2 i y 3 i y L i 1 i 2 i 3 i L i
33 Inference: Variational Bayes x i W Approximate Gaussian z 1 i z 2 i z k i Approximate Gaussian y 1 i y 2 i y 3 i y L i 1 i 2 i 3 i L i
34 Inference: Variational Bayes x i W Approximate Gaussian z 1 i z 2 i z k i Approximate Gaussian y 1 i y 2 i y 3 i y L i Approximate Gamma 1 i 2 i 3 i L i
35 Inference: Variational Bayes x i W z 1 i z 2 i z k i Approximate Gaussian y 1 i y 2 i y 3 i y L i 1 i 2 i 3 i L i
36 Active Learning Criteria Entropy: Is a measure of uncertainty. For a random variable X, the entropy H is given as:! H(X) = X i P (x i ) log(p (x i )) Picks points far apart from each other For a Gaussian process, H = 1 2 log( )+const
37 Active Learning Criteria Mutual Information: Measures reduction in uncertainty over unlabeled space! MI(A, B) =H(A) H(A B) Used in past work successfully for regression
38 Active Learning: Mutual Information We have already modeled the distribution over labels, Y as a Gaussian process The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space A = arg A U max H(Y U\A ) H(Y U\A A)
39 Active Learning: Mutual Information We have already modeled the distribution over labels, Y as a Gaussian process The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space Problem: Variance is not preserved across layers A = arg A U max H(Y U\A ) H(Y U\A A)
40 Idea: Collapsed Variational Bayes
41 Idea: Collapsed Variational Bayes xi W f xi (W, z i )=e W T xi z i y i z i 2 g (y i,z i )=e 2 2 z 1 i z 2 i z k i y j i N(0, 1 j i ) y 1 i y 2 i y 3 i y L i j i ( j i ; a 0,b 0 ) 1 i 2 i 3 i L i
42 Idea: Collapsed Variational Bayes x i f xi (W, z i )=e W T xi z i W y i z i 2 g (y i,z i )=e 2 2 z 1 i z 2 i z k i y j i N(0, 1 j i ) y 1 i y 2 i y 3 i y L i j i ( j i ; a 0,b 0 ) 1 i 2 i 3 i L i
43 Idea: Collapsed Variational Bayes Integrate to get a Gaussian x i distribution over Y f xi (W, z i )=e W T xi z i W y i z i 2 g (y i,z i )=e 2 2 z 1 i z 2 i z k i y j i N(0, 1 j i ) y 1 i y 2 i y 3 i y L i j i ( j i ; a 0,b 0 ) 1 i 2 i 3 i L i
44 Idea: Collapsed Variational Bayes Integrate to get a Gaussian x i distribution over Y f xi (W, z i )=e W T xi z i y i z i 2 g (y i,z i )=e 2 2 z 1 i z 2 i z k i W Use Variational Bayes for sparsity y j i N(0, 1 j i ) y 1 i y 2 i y 3 i y L i j i ( j i ; a 0,b 0 ) 1 i 2 i 3 i L i
45 Active Learning: Mutual Information We have already modeled the distribution over labels, Y as a Gaussian process The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space A = arg A U max H(Y U\A ) H(Y U\A A)
46 Active Learning: Mutual Information We have already modeled the distribution over labels, Y as a Gaussian process The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space Problem: Computing Mutual Information still needs A = arg A U max H(Y U\A ) exponential time H(Y U\A A)
47 Solution: Approximate Mutual Information Approximate the final distribution over Y by a Gaussian Use the Gaussian to estimate the mutual information Theorem 1: lim a 0!0,b 0!0 MI ˆ! MI
48 Active Learning: Mutual Information We have already modeled the distribution over labels, Y as a Gaussian process The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space Problem: Subset selection problem is NP complete A = arg A U max H(Y U\A ) H(Y U\A A)
49 Solution: Use Submodularity Under some weak conditions, the objective is sub-modular Sub-modularity ensures that the greedy solution is a constant times the optimal solution
50 Algorithm Input: Feature vectors for a set of unlabeled instance, U and a budget n Iteratively, add a datapoint x to labeled set A, such that x leads to maximum increase in MI x arg max x2u\a MI(A ˆ [ x) MI(A) ˆ
51 Performance Evaluation
52 Datasets Dataset Type Instances Features Labels Yeast Biology MSRC Image Medical Text Enron Text Mediamill Video RCV1 Text Bookmarks Text Delicious Text
53 Setup Unlabeled pool size: 4000 points, test size: 2000 points For smaller datasets, the entire data was in unlabeled pool. Testing on all unlabeled data Initial seed size: 500 points
54 Compared Algorithms MIML: Mutual Information for Multilabel Classification (proposed method). Uncert: Uncertainty sampling (Entropy based) Rand: Random sampling Li-Adaptive*: SVM based adaptive active learning *Li et al, IJCAI 2013
55 Traditional Active Learning Labels Iraq Flowers Sun Sky L Datapoints N Which data points should I label?
56 Traditional Active Learning
57 Traditional Active Learning Rand Uncert Li-Adaptive MIML Delicious (983) Rand Uncert Yeast (14) Li-Adaptive MIML Mean Precision #points #points
58 Traditional Active Learning Rand Uncert Li-Adaptive MIML Delicious (983) Rand Uncert Yeast (14) Li-Adaptive MIML Mean Precision #points #points
59 Traditional Active Learning Rand Uncert Li-Adaptive MIML Delicious (983) Rand Uncert Yeast (14) Li-Adaptive MIML Mean Precision #points #points
60 1 2 3 Active Learning Labels Iraq Flowers Sun Sky L Datapoints For a particular datapoint, N which labels should I reveal?
61 Active Diagnosis
62 Active Diagnosis 0.7 Rand Uncert MIML RCV F Score # labels
63 Active Diagnosis 0.7 Rand Uncert MIML RCV F Score # labels
64 Generalized Active Learning Labels Iraq Flowers Sun Sky L Datapoints Can I choose datapoint-label pairs N to annotate?
65 Generalized Active Learning
66 Generalized Active Learning 0.26 Rand Uncert MIML RCV F Score #points
67 Generalized Active Learning 0.26 Rand Uncert MIML RCV F Score #points
68 Time Complexity Dataset Labels MIML Li-Adaptive Yeast 14 3m 25s 1m 54s Mediamill m 29s 54m 35s RCV m 45s 37m 35s Bookmarks m 58s 3h 57m Delicious 983 1h 11m 20h 15m
69 Related Work SVM based Active Learning: Li et al [IJCAI, 2013], Yang et al [KDD 2009], Esuli et al [ECIR 2009], Li et al [ICIP 2004], Mutual Information: Krause et al [UAI 2005], Krause et al [JMLR 2008], Singh et al [JAIR 2009],
70 Conclusion Proposed mutual information based active learning for multi-label classification Collapsed Variational Bayes to infer variances Theoretical analysis of mutual information approximation showing that it is near-optimal Showed significant empirical improvements over the state-of-the-art
Adaptive Submodularity with Varying Query Sets: An Application to Active Multi-label Learning
Proceedings of Machine Learning Research 76:1 16, 2017 Algorithmic Learning Theory 2017 Adaptive Submodularity with Varying Query Sets: An Application to Active Multi-label Learning Alan Fern School of
More informationMulti-label Active Learning with Auxiliary Learner
Multi-label Active Learning with Auxiliary Learner Chen-Wei Hung and Hsuan-Tien Lin National Taiwan University November 15, 2011 C.-W. Hung & H.-T. Lin (NTU) Multi-label AL w/ Auxiliary Learner 11/15/2011
More informationDecision trees COMS 4771
Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationA Deep Interpretation of Classifier Chains
A Deep Interpretation of Classifier Chains Jesse Read and Jaakko Holmén http://users.ics.aalto.fi/{jesse,jhollmen}/ Aalto University School of Science, Department of Information and Computer Science and
More informationActive Learning and Optimized Information Gathering
Active Learning and Optimized Information Gathering Lecture 13 Submodularity (cont d) CS 101.2 Andreas Krause Announcements Homework 2: Due Thursday Feb 19 Project milestone due: Feb 24 4 Pages, NIPS format:
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationthe tree till a class assignment is reached
Decision Trees Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached Definitions Internal
More informationTwo-Dimensional Active Learning for Image Classification
Two-Dimensional Active Learning for Image Classification Guo-Jun Qi, Xian-Sheng Hua, Yong Rui, Jinhui Tang, Hong-Jiang Zhang MOE-Microsoft Key Laboratory of Multimedia Computing and Communication & Department
More informationMSc Project Feature Selection using Information Theoretic Techniques. Adam Pocock
MSc Project Feature Selection using Information Theoretic Techniques Adam Pocock pococka4@cs.man.ac.uk 15/08/2008 Abstract This document presents a investigation into 3 different areas of feature selection,
More informationCSE 546 Final Exam, Autumn 2013
CSE 546 Final Exam, Autumn 0. Personal info: Name: Student ID: E-mail address:. There should be 5 numbered pages in this exam (including this cover sheet).. You can use any material you brought: any book,
More informationBinary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach
Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach Krzysztof Dembczyński and Wojciech Kot lowski Intelligent Decision Support Systems Laboratory (IDSS) Poznań
More informationFINAL: CS 6375 (Machine Learning) Fall 2014
FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for
More informationOptimization of Classifier Chains via Conditional Likelihood Maximization
Optimization of Classifier Chains via Conditional Likelihood Maximization Lu Sun, Mineichi Kudo Graduate School of Information Science and Technology, Hokkaido University, Sapporo 060-0814, Japan Abstract
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Chapter 8&9: Classification: Part 3 Instructor: Yizhou Sun yzsun@ccs.neu.edu March 12, 2013 Midterm Report Grade Distribution 90-100 10 80-89 16 70-79 8 60-69 4
More informationADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II
1 Non-linear regression techniques Part - II Regression Algorithms in this Course Support Vector Machine Relevance Vector Machine Support vector regression Boosting random projections Relevance vector
More informationSparse Support Vector Machines by Kernel Discriminant Analysis
Sparse Support Vector Machines by Kernel Discriminant Analysis Kazuki Iwamura and Shigeo Abe Kobe University - Graduate School of Engineering Kobe, Japan Abstract. We discuss sparse support vector machines
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic
More informationIntroduction to Probabilistic Machine Learning
Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning
More informationReview: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)
Case Study 4: Collaborative Filtering Review: Probabilistic Matrix Factorization Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 2 th, 214 Emily Fox 214 1 Probabilistic
More informationProbabilistic and Logistic Circuits: A New Synthesis of Logic and Machine Learning
Probabilistic and Logistic Circuits: A New Synthesis of Logic and Machine Learning Guy Van den Broeck KULeuven Symposium Dec 12, 2018 Outline Learning Adding knowledge to deep learning Logistic circuits
More informationBayesian Support Vector Machines for Feature Ranking and Selection
Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction
More informationClassifier Chains for Multi-label Classification
Classifier Chains for Multi-label Classification Jesse Read, Bernhard Pfahringer, Geoff Holmes, Eibe Frank University of Waikato New Zealand ECML PKDD 2009, September 9, 2009. Bled, Slovenia J. Read, B.
More informationWhat is semi-supervised learning?
What is semi-supervised learning? In many practical learning domains, there is a large supply of unlabeled data but limited labeled data, which can be expensive to generate text processing, video-indexing,
More informationMODULE -4 BAYEIAN LEARNING
MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities
More informationDecision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag
Decision Trees Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Supervised Learning Input: labelled training data i.e., data plus desired output Assumption:
More informationCSCE 478/878 Lecture 6: Bayesian Learning
Bayesian Methods Not all hypotheses are created equal (even if they are all consistent with the training data) Outline CSCE 478/878 Lecture 6: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchell
More informationcxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c
Warm up D cai.yo.ie p IExrL9CxsYD Sglx.Ddl f E Luo fhlexi.si dbll Fix any a, b, c > 0. 1. What is the x 2 R that minimizes ax 2 + bx + c x a b Ta OH 2 ax 16 0 x 1 Za fhkxiiso3ii draulx.h dp.d 2. What is
More informationConditional Random Field
Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions
More informationMixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes
Mixtures of Gaussians with Sparse Regression Matrices Constantinos Boulis, Jeffrey Bilmes {boulis,bilmes}@ee.washington.edu Dept of EE, University of Washington Seattle WA, 98195-2500 UW Electrical Engineering
More informationBayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine
Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview
More informationSample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University
Sample and Computationally Efficient Active Learning Maria-Florina Balcan Carnegie Mellon University Machine Learning is Shaping the World Highly successful discipline with lots of applications. Computational
More informationMulti-label Subspace Ensemble
Multi-label Subspace Ensemble ianyi Zhou Centre for Quantum Computation & Intelligent Systems University of echnology Sydney, Australia tianyizhou@studentutseduau Dacheng ao Centre for Quantum Computation
More informationFACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION
SunLab Enlighten the World FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION Ioakeim (Kimis) Perros and Jimeng Sun perros@gatech.edu, jsun@cc.gatech.edu COMPUTATIONAL
More informationProbabilistic Graphical Models for Image Analysis - Lecture 1
Probabilistic Graphical Models for Image Analysis - Lecture 1 Alexey Gronskiy, Stefan Bauer 21 September 2018 Max Planck ETH Center for Learning Systems Overview 1. Motivation - Why Graphical Models 2.
More informationMixture of Gaussians Models
Mixture of Gaussians Models Outline Inference, Learning, and Maximum Likelihood Why Mixtures? Why Gaussians? Building up to the Mixture of Gaussians Single Gaussians Fully-Observed Mixtures Hidden Mixtures
More informationA Least Squares Formulation for Canonical Correlation Analysis
A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University Motivation Canonical Correlation
More informationNEAR-OPTIMALITY AND ROBUSTNESS OF GREEDY ALGORITHMS FOR BAYESIAN POOL-BASED ACTIVE LEARNING
NEAR-OPTIMALITY AND ROBUSTNESS OF GREEDY ALGORITHMS FOR BAYESIAN POOL-BASED ACTIVE LEARNING NGUYEN VIET CUONG (B.Comp.(Hons.), NUS) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT
More informationCS6375: Machine Learning Gautam Kunapuli. Decision Trees
Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s
More informationGenerative MaxEnt Learning for Multiclass Classification
Generative Maximum Entropy Learning for Multiclass Classification A. Dukkipati, G. Pandey, D. Ghoshdastidar, P. Koley, D. M. V. S. Sriram Dept. of Computer Science and Automation Indian Institute of Science,
More informationIs the test error unbiased for these programs? 2017 Kevin Jamieson
Is the test error unbiased for these programs? 2017 Kevin Jamieson 1 Is the test error unbiased for this program? 2017 Kevin Jamieson 2 Simple Variable Selection LASSO: Sparse Regression Machine Learning
More informationMachine Learning, Fall 2009: Midterm
10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all
More informationTerm Filtering with Bounded Error
Term Filtering with Bounded Error Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi, tangjie, ljz}@keg.cs.tsinghua.edu.cn
More informationInformation Theory and Feature Selection (Joint Informativeness and Tractability)
Information Theory and Feature Selection (Joint Informativeness and Tractability) Leonidas Lefakis Zalando Research Labs 1 / 66 Dimensionality Reduction Feature Construction Construction X 1,..., X D f
More informationDynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji
Dynamic Data Modeling, Recognition, and Synthesis Rui Zhao Thesis Defense Advisor: Professor Qiang Ji Contents Introduction Related Work Dynamic Data Modeling & Analysis Temporal localization Insufficient
More informationMachine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:
Simple Variable Selection LASSO: Sparse Regression Machine Learning CSE546 Carlos Guestrin University of Washington October 7, 2013 1 Sparsity Vector w is sparse, if many entries are zero: Very useful
More informationDirect Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina
Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:
More informationVCMC: Variational Consensus Monte Carlo
VCMC: Variational Consensus Monte Carlo Maxim Rabinovich, Elaine Angelino, Michael I. Jordan Berkeley Vision and Learning Center September 22, 2015 probabilistic models! sky fog bridge water grass object
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationMixtures of Gaussians with Sparse Structure
Mixtures of Gaussians with Sparse Structure Costas Boulis 1 Abstract When fitting a mixture of Gaussians to training data there are usually two choices for the type of Gaussians used. Either diagonal or
More informationActive learning in sequence labeling
Active learning in sequence labeling Tomáš Šabata 11. 5. 2017 Czech Technical University in Prague Faculty of Information technology Department of Theoretical Computer Science Table of contents 1. Introduction
More informationActive Learning and Optimized Information Gathering
Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office
More information11. Learning graphical models
Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical
More informationABC-LogitBoost for Multi-Class Classification
Ping Li, Cornell University ABC-Boost BTRY 6520 Fall 2012 1 ABC-LogitBoost for Multi-Class Classification Ping Li Department of Statistical Science Cornell University 2 4 6 8 10 12 14 16 2 4 6 8 10 12
More informationGenerative Clustering, Topic Modeling, & Bayesian Inference
Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationImproved Bayesian Compression
Improved Bayesian Compression Marco Federici University of Amsterdam marco.federici@student.uva.nl Karen Ullrich University of Amsterdam karen.ullrich@uva.nl Max Welling University of Amsterdam Canadian
More informationSurrogate regret bounds for generalized classification performance metrics
Surrogate regret bounds for generalized classification performance metrics Wojciech Kotłowski Krzysztof Dembczyński Poznań University of Technology PL-SIGML, Częstochowa, 14.04.2016 1 / 36 Motivation 2
More informationBatch Mode Sparse Active Learning. Lixin Shi, Yuhang Zhao Tsinghua University
Batch Mode Sparse Active Learning Lixin Shi, Yuhang Zhao Tsinghua University Our work Propose an unified framework of batch mode active learning Instantiate the framework using classifiers based on sparse
More informationEvidence-Based Uncertainty Sampling for Active Learning
Data Min Knowl Disc This is authors accepted version. Publisher s version is available at http://link.springer.com/article/10.1007/s10618-016-0460-3. Evidence-Based Uncertainty Sampling for Active Learning
More informationThe geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan
The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm
More informationRecent Advances in Bayesian Inference Techniques
Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian
More informationFoundations of Machine Learning
Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about
More informationBayesian Learning (II)
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP
More informationA Simple Algorithm for Multilabel Ranking
A Simple Algorithm for Multilabel Ranking Krzysztof Dembczyński 1 Wojciech Kot lowski 1 Eyke Hüllermeier 2 1 Intelligent Decision Support Systems Laboratory (IDSS), Poznań University of Technology, Poland
More informationMulti-label learning from batch and streaming data
Multi-label learning from batch and streaming data Jesse Read Télécom ParisTech École Polytechnique Summer School on Mining Big and Complex Data 5 September 26 Ohrid, Macedonia Introduction x = Binary
More informationBeyond Active Noun Tagging: Modeling Contextual Interactions for Multi-Class Active Learning
Beyond Active Noun Tagging: Modeling Contextual Interactions for Multi-Class Active Learning Behjat Siddiquie Department of Computer Science University of Maryland, College Park behjat@cs.umd.edu Abhinav
More informationSparse Algorithms are not Stable: A No-free-lunch Theorem
Sparse Algorithms are not Stable: A No-free-lunch Theorem Huan Xu Shie Mannor Constantine Caramanis Abstract We consider two widely used notions in machine learning, namely: sparsity algorithmic stability.
More informationBayesian Learning Features of Bayesian learning methods:
Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct. This provides a more
More informationCalibrated Uncertainty in Deep Learning
Calibrated Uncertainty in Deep Learning U NCERTAINTY IN DEEP LEARNING W ORKSHOP @ UAI18 Volodymyr Kuleshov August 10, 2018 Estimating Uncertainty is Crucial in Many Applications Assessing uncertainty can
More informationLearning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute
More informationDistributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College
Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic
More informationStatistical learning theory, Support vector machines, and Bioinformatics
1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.
More informationBayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationIntroduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees
Introduction to ML Two examples of Learners: Naïve Bayesian Classifiers Decision Trees Why Bayesian learning? Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical
More informationCSE 151 Machine Learning. Instructor: Kamalika Chaudhuri
CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of
More informationEM (cont.) November 26 th, Carlos Guestrin 1
EM (cont.) Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University November 26 th, 2007 1 Silly Example Let events be grades in a class w 1 = Gets an A P(A) = ½ w 2 = Gets a B P(B) = µ
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationComputational Genomics. Systems biology. Putting it together: Data integration using graphical models
02-710 Computational Genomics Systems biology Putting it together: Data integration using graphical models High throughput data So far in this class we discussed several different types of high throughput
More informationLarge Scale Semi-supervised Linear SVMs. University of Chicago
Large Scale Semi-supervised Linear SVMs Vikas Sindhwani and Sathiya Keerthi University of Chicago SIGIR 2006 Semi-supervised Learning (SSL) Motivation Setting Categorize x-billion documents into commercial/non-commercial.
More informationMachine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017
Machine Learning Regularization and Feature Selection Fabio Vandin November 14, 2017 1 Regularized Loss Minimization Assume h is defined by a vector w = (w 1,..., w d ) T R d (e.g., linear models) Regularization
More information10-701/ Machine Learning - Midterm Exam, Fall 2010
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationExamples are not Enough, Learn to Criticize! Criticism for Interpretability
Examples are not Enough, Learn to Criticize! Criticism for Interpretability Been Kim, Rajiv Khanna, Oluwasanmi Koyejo Wittawat Jitkrittum Gatsby Machine Learning Journal Club 16 Jan 2017 1/20 Summary Examples
More informationClick Prediction and Preference Ranking of RSS Feeds
Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS
More informationOn Bayesian Computation
On Bayesian Computation Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang Previous Work: Information Constraints on Inference Minimize the minimax risk under constraints
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More informationKernel Methods and Support Vector Machines
Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete
More information02 Background Minimum background on probability. Random process
0 Background 0.03 Minimum background on probability Random processes Probability Conditional probability Bayes theorem Random variables Sampling and estimation Variance, covariance and correlation Probability
More informationFantope Regularization in Metric Learning
Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France Introduction
More informationCS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs
CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs Professor Erik Sudderth Brown University Computer Science October 6, 2016 Some figures and materials courtesy
More informationMachine Learning - MT & 5. Basis Expansion, Regularization, Validation
Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships
More informationTutorial on Gaussian Processes and the Gaussian Process Latent Variable Model
Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationGlobal vs. Multiscale Approaches
Harmonic Analysis on Graphs Global vs. Multiscale Approaches Weizmann Institute of Science, Rehovot, Israel July 2011 Joint work with Matan Gavish (WIS/Stanford), Ronald Coifman (Yale), ICML 10' Challenge:
More information