Active Learning for Sparse Bayesian Multilabel Classification

Size: px
Start display at page:

Download "Active Learning for Sparse Bayesian Multilabel Classification"

Transcription

1 Active Learning for Sparse Bayesian Multilabel Classification Deepak Vasisht, MIT & IIT Delhi Andreas Domianou, University of Sheffield Manik Varma, MSR, India Ashish Kapoor, MSR, Redmond

2 Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels.

3 Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels.

4 Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels. x i 2 R d Feature vector, d: dimension of the feature space

5 Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels. x i 2 R d Feature vector, d: dimension of the feature space Iraq Flowers Human Brick Sea Sun Sky

6 Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels. x i 2 R d Feature vector, d: dimension of the feature space Iraq Flowers Human Brick Sea Sun Sky

7 Training

8 Training

9 Training WikiLSHTC has 325k labels. Good luck with that!!

10 Training Is Expensive Training data can also be very expensive, like genomic data, chemical data Getting each label incurs additional cost

11 Training Is Expensive Training data can also be very expensive, like genomic data, chemical data Getting each label incurs additional cost Need to reduce the required training data.

12 1 2 3 Active Learning Labels Iraq Flowers Sun Sky L Datapoints N

13 1 2 3 Active Learning Labels Iraq Flowers Sun Sky L 1 0 Datapoints N

14 1 2 3 Active Learning Labels Iraq Flowers Sun Sky L 1 0 Datapoints N

15 1 2 3 Active Learning Labels Iraq Flowers Sun Sky L Datapoints N Which data points should I label?

16 1 2 3 Active Learning Labels Iraq Flowers Sun Sky L Datapoints For a particular datapoint, N which labels should I reveal?

17 1 2 3 Active Learning Labels Iraq Flowers Sun Sky L Datapoints Can I choose datapoint-label pairs N to annotate?

18 In this talk An active learner for Multi-label classification that: Answers all your questions Is Computationally Cheap Is Non myopic and near-optimal Incorporates label sparsity Achieves higher accuracy than state-of-the-art

19 Classification

20 Classification Model* x i Labels y 1 i y 2 i y 3 i y L i *Kapoor et al, NIPS 2012

21 Classification Model* x i Compressed Space z 1 i z 2 i z k i Labels y 1 i y 2 i y 3 i y L i *Kapoor et al, NIPS 2012

22 Classification Model* x i W Compressed Space z 1 i z 2 i z k i Labels y 1 i y 2 i y 3 i y L i *Kapoor et al, NIPS 2012

23 Classification Model* x i W Compressed Space z 1 i z 2 i z k i Labels y 1 i y 2 i y 3 i y L i Sparsity 1 i 2 i 3 i L i *Kapoor et al, NIPS 2012

24 Classification Model: Potentials x i f xi (W, z i )=e W T xi z i W z 1 i z 2 i z k i y 1 i y 2 i y 3 i y L i 1 i 2 i 3 i L i

25 Classification Model: Potentials x i f xi (W, z i )=e W T xi z i W y i z i 2 g (y i,z i )=e 2 2 z 1 i z 2 i z k i y 1 i y 2 i y 3 i y L i 1 i 2 i 3 i L i

26 Classification Model: Priors x i W z 1 i z 2 i z k i y j i N(0, 1 j i ) y 1 i y 2 i y 3 i y L i 1 i 2 i 3 i L i

27 Classification Model: Priors x i W z 1 i z 2 i z k i y j i N(0, 1 j i ) y 1 i y 2 i y 3 i y L i j i ( j i ; a 0,b 0 ) 1 i 2 i 3 i L i

28 Sparsity Priors a 0 = 10 6,b 0 = 10 6

29 Classification Model x i W f xi (W, z i )=e W T xi z i y i z i 2 g (y i,z i )=e 2 2 z 1 i z 2 i z k i y j i N(0, 1 j i ) y 1 i y 2 i y 3 i y L i j i ( j i ; a 0,b 0 ) 1 i 2 i 3 i L i

30 Classification Model x i W f xi (W, z i )=e W T xi z i y i z i 2 g (y i,z i )=e 2 2 z 1 i z 2 i z k i y j i N(0, 1 j i ) yi 1 yi 2 yi 3 yi L Problem: Exact inference is intractable. j i ( j i ; a 0,b 0 ) 1 i 2 i 3 i L i

31 Inference: Variational Bayes x i W z 1 i z 2 i z k i y 1 i y 2 i y 3 i y L i 1 i 2 i 3 i L i

32 Inference: Variational Bayes x i W Approximate Gaussian z 1 i z 2 i z k i y 1 i y 2 i y 3 i y L i 1 i 2 i 3 i L i

33 Inference: Variational Bayes x i W Approximate Gaussian z 1 i z 2 i z k i Approximate Gaussian y 1 i y 2 i y 3 i y L i 1 i 2 i 3 i L i

34 Inference: Variational Bayes x i W Approximate Gaussian z 1 i z 2 i z k i Approximate Gaussian y 1 i y 2 i y 3 i y L i Approximate Gamma 1 i 2 i 3 i L i

35 Inference: Variational Bayes x i W z 1 i z 2 i z k i Approximate Gaussian y 1 i y 2 i y 3 i y L i 1 i 2 i 3 i L i

36 Active Learning Criteria Entropy: Is a measure of uncertainty. For a random variable X, the entropy H is given as:! H(X) = X i P (x i ) log(p (x i )) Picks points far apart from each other For a Gaussian process, H = 1 2 log( )+const

37 Active Learning Criteria Mutual Information: Measures reduction in uncertainty over unlabeled space! MI(A, B) =H(A) H(A B) Used in past work successfully for regression

38 Active Learning: Mutual Information We have already modeled the distribution over labels, Y as a Gaussian process The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space A = arg A U max H(Y U\A ) H(Y U\A A)

39 Active Learning: Mutual Information We have already modeled the distribution over labels, Y as a Gaussian process The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space Problem: Variance is not preserved across layers A = arg A U max H(Y U\A ) H(Y U\A A)

40 Idea: Collapsed Variational Bayes

41 Idea: Collapsed Variational Bayes xi W f xi (W, z i )=e W T xi z i y i z i 2 g (y i,z i )=e 2 2 z 1 i z 2 i z k i y j i N(0, 1 j i ) y 1 i y 2 i y 3 i y L i j i ( j i ; a 0,b 0 ) 1 i 2 i 3 i L i

42 Idea: Collapsed Variational Bayes x i f xi (W, z i )=e W T xi z i W y i z i 2 g (y i,z i )=e 2 2 z 1 i z 2 i z k i y j i N(0, 1 j i ) y 1 i y 2 i y 3 i y L i j i ( j i ; a 0,b 0 ) 1 i 2 i 3 i L i

43 Idea: Collapsed Variational Bayes Integrate to get a Gaussian x i distribution over Y f xi (W, z i )=e W T xi z i W y i z i 2 g (y i,z i )=e 2 2 z 1 i z 2 i z k i y j i N(0, 1 j i ) y 1 i y 2 i y 3 i y L i j i ( j i ; a 0,b 0 ) 1 i 2 i 3 i L i

44 Idea: Collapsed Variational Bayes Integrate to get a Gaussian x i distribution over Y f xi (W, z i )=e W T xi z i y i z i 2 g (y i,z i )=e 2 2 z 1 i z 2 i z k i W Use Variational Bayes for sparsity y j i N(0, 1 j i ) y 1 i y 2 i y 3 i y L i j i ( j i ; a 0,b 0 ) 1 i 2 i 3 i L i

45 Active Learning: Mutual Information We have already modeled the distribution over labels, Y as a Gaussian process The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space A = arg A U max H(Y U\A ) H(Y U\A A)

46 Active Learning: Mutual Information We have already modeled the distribution over labels, Y as a Gaussian process The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space Problem: Computing Mutual Information still needs A = arg A U max H(Y U\A ) exponential time H(Y U\A A)

47 Solution: Approximate Mutual Information Approximate the final distribution over Y by a Gaussian Use the Gaussian to estimate the mutual information Theorem 1: lim a 0!0,b 0!0 MI ˆ! MI

48 Active Learning: Mutual Information We have already modeled the distribution over labels, Y as a Gaussian process The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space Problem: Subset selection problem is NP complete A = arg A U max H(Y U\A ) H(Y U\A A)

49 Solution: Use Submodularity Under some weak conditions, the objective is sub-modular Sub-modularity ensures that the greedy solution is a constant times the optimal solution

50 Algorithm Input: Feature vectors for a set of unlabeled instance, U and a budget n Iteratively, add a datapoint x to labeled set A, such that x leads to maximum increase in MI x arg max x2u\a MI(A ˆ [ x) MI(A) ˆ

51 Performance Evaluation

52 Datasets Dataset Type Instances Features Labels Yeast Biology MSRC Image Medical Text Enron Text Mediamill Video RCV1 Text Bookmarks Text Delicious Text

53 Setup Unlabeled pool size: 4000 points, test size: 2000 points For smaller datasets, the entire data was in unlabeled pool. Testing on all unlabeled data Initial seed size: 500 points

54 Compared Algorithms MIML: Mutual Information for Multilabel Classification (proposed method). Uncert: Uncertainty sampling (Entropy based) Rand: Random sampling Li-Adaptive*: SVM based adaptive active learning *Li et al, IJCAI 2013

55 Traditional Active Learning Labels Iraq Flowers Sun Sky L Datapoints N Which data points should I label?

56 Traditional Active Learning

57 Traditional Active Learning Rand Uncert Li-Adaptive MIML Delicious (983) Rand Uncert Yeast (14) Li-Adaptive MIML Mean Precision #points #points

58 Traditional Active Learning Rand Uncert Li-Adaptive MIML Delicious (983) Rand Uncert Yeast (14) Li-Adaptive MIML Mean Precision #points #points

59 Traditional Active Learning Rand Uncert Li-Adaptive MIML Delicious (983) Rand Uncert Yeast (14) Li-Adaptive MIML Mean Precision #points #points

60 1 2 3 Active Learning Labels Iraq Flowers Sun Sky L Datapoints For a particular datapoint, N which labels should I reveal?

61 Active Diagnosis

62 Active Diagnosis 0.7 Rand Uncert MIML RCV F Score # labels

63 Active Diagnosis 0.7 Rand Uncert MIML RCV F Score # labels

64 Generalized Active Learning Labels Iraq Flowers Sun Sky L Datapoints Can I choose datapoint-label pairs N to annotate?

65 Generalized Active Learning

66 Generalized Active Learning 0.26 Rand Uncert MIML RCV F Score #points

67 Generalized Active Learning 0.26 Rand Uncert MIML RCV F Score #points

68 Time Complexity Dataset Labels MIML Li-Adaptive Yeast 14 3m 25s 1m 54s Mediamill m 29s 54m 35s RCV m 45s 37m 35s Bookmarks m 58s 3h 57m Delicious 983 1h 11m 20h 15m

69 Related Work SVM based Active Learning: Li et al [IJCAI, 2013], Yang et al [KDD 2009], Esuli et al [ECIR 2009], Li et al [ICIP 2004], Mutual Information: Krause et al [UAI 2005], Krause et al [JMLR 2008], Singh et al [JAIR 2009],

70 Conclusion Proposed mutual information based active learning for multi-label classification Collapsed Variational Bayes to infer variances Theoretical analysis of mutual information approximation showing that it is near-optimal Showed significant empirical improvements over the state-of-the-art

Adaptive Submodularity with Varying Query Sets: An Application to Active Multi-label Learning

Adaptive Submodularity with Varying Query Sets: An Application to Active Multi-label Learning Proceedings of Machine Learning Research 76:1 16, 2017 Algorithmic Learning Theory 2017 Adaptive Submodularity with Varying Query Sets: An Application to Active Multi-label Learning Alan Fern School of

More information

Multi-label Active Learning with Auxiliary Learner

Multi-label Active Learning with Auxiliary Learner Multi-label Active Learning with Auxiliary Learner Chen-Wei Hung and Hsuan-Tien Lin National Taiwan University November 15, 2011 C.-W. Hung & H.-T. Lin (NTU) Multi-label AL w/ Auxiliary Learner 11/15/2011

More information

Decision trees COMS 4771

Decision trees COMS 4771 Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

A Deep Interpretation of Classifier Chains

A Deep Interpretation of Classifier Chains A Deep Interpretation of Classifier Chains Jesse Read and Jaakko Holmén http://users.ics.aalto.fi/{jesse,jhollmen}/ Aalto University School of Science, Department of Information and Computer Science and

More information

Active Learning and Optimized Information Gathering

Active Learning and Optimized Information Gathering Active Learning and Optimized Information Gathering Lecture 13 Submodularity (cont d) CS 101.2 Andreas Krause Announcements Homework 2: Due Thursday Feb 19 Project milestone due: Feb 24 4 Pages, NIPS format:

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

the tree till a class assignment is reached

the tree till a class assignment is reached Decision Trees Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached Definitions Internal

More information

Two-Dimensional Active Learning for Image Classification

Two-Dimensional Active Learning for Image Classification Two-Dimensional Active Learning for Image Classification Guo-Jun Qi, Xian-Sheng Hua, Yong Rui, Jinhui Tang, Hong-Jiang Zhang MOE-Microsoft Key Laboratory of Multimedia Computing and Communication & Department

More information

MSc Project Feature Selection using Information Theoretic Techniques. Adam Pocock

MSc Project Feature Selection using Information Theoretic Techniques. Adam Pocock MSc Project Feature Selection using Information Theoretic Techniques Adam Pocock pococka4@cs.man.ac.uk 15/08/2008 Abstract This document presents a investigation into 3 different areas of feature selection,

More information

CSE 546 Final Exam, Autumn 2013

CSE 546 Final Exam, Autumn 2013 CSE 546 Final Exam, Autumn 0. Personal info: Name: Student ID: E-mail address:. There should be 5 numbered pages in this exam (including this cover sheet).. You can use any material you brought: any book,

More information

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach

Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach Binary Classification, Multi-label Classification and Ranking: A Decision-theoretic Approach Krzysztof Dembczyński and Wojciech Kot lowski Intelligent Decision Support Systems Laboratory (IDSS) Poznań

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

Optimization of Classifier Chains via Conditional Likelihood Maximization

Optimization of Classifier Chains via Conditional Likelihood Maximization Optimization of Classifier Chains via Conditional Likelihood Maximization Lu Sun, Mineichi Kudo Graduate School of Information Science and Technology, Hokkaido University, Sapporo 060-0814, Japan Abstract

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Chapter 8&9: Classification: Part 3 Instructor: Yizhou Sun yzsun@ccs.neu.edu March 12, 2013 Midterm Report Grade Distribution 90-100 10 80-89 16 70-79 8 60-69 4

More information

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II 1 Non-linear regression techniques Part - II Regression Algorithms in this Course Support Vector Machine Relevance Vector Machine Support vector regression Boosting random projections Relevance vector

More information

Sparse Support Vector Machines by Kernel Discriminant Analysis

Sparse Support Vector Machines by Kernel Discriminant Analysis Sparse Support Vector Machines by Kernel Discriminant Analysis Kazuki Iwamura and Shigeo Abe Kobe University - Graduate School of Engineering Kobe, Japan Abstract. We discuss sparse support vector machines

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF) Case Study 4: Collaborative Filtering Review: Probabilistic Matrix Factorization Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 2 th, 214 Emily Fox 214 1 Probabilistic

More information

Probabilistic and Logistic Circuits: A New Synthesis of Logic and Machine Learning

Probabilistic and Logistic Circuits: A New Synthesis of Logic and Machine Learning Probabilistic and Logistic Circuits: A New Synthesis of Logic and Machine Learning Guy Van den Broeck KULeuven Symposium Dec 12, 2018 Outline Learning Adding knowledge to deep learning Logistic circuits

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

Classifier Chains for Multi-label Classification

Classifier Chains for Multi-label Classification Classifier Chains for Multi-label Classification Jesse Read, Bernhard Pfahringer, Geoff Holmes, Eibe Frank University of Waikato New Zealand ECML PKDD 2009, September 9, 2009. Bled, Slovenia J. Read, B.

More information

What is semi-supervised learning?

What is semi-supervised learning? What is semi-supervised learning? In many practical learning domains, there is a large supply of unlabeled data but limited labeled data, which can be expensive to generate text processing, video-indexing,

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag Decision Trees Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Supervised Learning Input: labelled training data i.e., data plus desired output Assumption:

More information

CSCE 478/878 Lecture 6: Bayesian Learning

CSCE 478/878 Lecture 6: Bayesian Learning Bayesian Methods Not all hypotheses are created equal (even if they are all consistent with the training data) Outline CSCE 478/878 Lecture 6: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchell

More information

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c Warm up D cai.yo.ie p IExrL9CxsYD Sglx.Ddl f E Luo fhlexi.si dbll Fix any a, b, c > 0. 1. What is the x 2 R that minimizes ax 2 + bx + c x a b Ta OH 2 ax 16 0 x 1 Za fhkxiiso3ii draulx.h dp.d 2. What is

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

Mixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes

Mixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes Mixtures of Gaussians with Sparse Regression Matrices Constantinos Boulis, Jeffrey Bilmes {boulis,bilmes}@ee.washington.edu Dept of EE, University of Washington Seattle WA, 98195-2500 UW Electrical Engineering

More information

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview

More information

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University Sample and Computationally Efficient Active Learning Maria-Florina Balcan Carnegie Mellon University Machine Learning is Shaping the World Highly successful discipline with lots of applications. Computational

More information

Multi-label Subspace Ensemble

Multi-label Subspace Ensemble Multi-label Subspace Ensemble ianyi Zhou Centre for Quantum Computation & Intelligent Systems University of echnology Sydney, Australia tianyizhou@studentutseduau Dacheng ao Centre for Quantum Computation

More information

FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION

FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION SunLab Enlighten the World FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION Ioakeim (Kimis) Perros and Jimeng Sun perros@gatech.edu, jsun@cc.gatech.edu COMPUTATIONAL

More information

Probabilistic Graphical Models for Image Analysis - Lecture 1

Probabilistic Graphical Models for Image Analysis - Lecture 1 Probabilistic Graphical Models for Image Analysis - Lecture 1 Alexey Gronskiy, Stefan Bauer 21 September 2018 Max Planck ETH Center for Learning Systems Overview 1. Motivation - Why Graphical Models 2.

More information

Mixture of Gaussians Models

Mixture of Gaussians Models Mixture of Gaussians Models Outline Inference, Learning, and Maximum Likelihood Why Mixtures? Why Gaussians? Building up to the Mixture of Gaussians Single Gaussians Fully-Observed Mixtures Hidden Mixtures

More information

A Least Squares Formulation for Canonical Correlation Analysis

A Least Squares Formulation for Canonical Correlation Analysis A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University Motivation Canonical Correlation

More information

NEAR-OPTIMALITY AND ROBUSTNESS OF GREEDY ALGORITHMS FOR BAYESIAN POOL-BASED ACTIVE LEARNING

NEAR-OPTIMALITY AND ROBUSTNESS OF GREEDY ALGORITHMS FOR BAYESIAN POOL-BASED ACTIVE LEARNING NEAR-OPTIMALITY AND ROBUSTNESS OF GREEDY ALGORITHMS FOR BAYESIAN POOL-BASED ACTIVE LEARNING NGUYEN VIET CUONG (B.Comp.(Hons.), NUS) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

Generative MaxEnt Learning for Multiclass Classification

Generative MaxEnt Learning for Multiclass Classification Generative Maximum Entropy Learning for Multiclass Classification A. Dukkipati, G. Pandey, D. Ghoshdastidar, P. Koley, D. M. V. S. Sriram Dept. of Computer Science and Automation Indian Institute of Science,

More information

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Is the test error unbiased for these programs? 2017 Kevin Jamieson Is the test error unbiased for these programs? 2017 Kevin Jamieson 1 Is the test error unbiased for this program? 2017 Kevin Jamieson 2 Simple Variable Selection LASSO: Sparse Regression Machine Learning

More information

Machine Learning, Fall 2009: Midterm

Machine Learning, Fall 2009: Midterm 10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all

More information

Term Filtering with Bounded Error

Term Filtering with Bounded Error Term Filtering with Bounded Error Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi, tangjie, ljz}@keg.cs.tsinghua.edu.cn

More information

Information Theory and Feature Selection (Joint Informativeness and Tractability)

Information Theory and Feature Selection (Joint Informativeness and Tractability) Information Theory and Feature Selection (Joint Informativeness and Tractability) Leonidas Lefakis Zalando Research Labs 1 / 66 Dimensionality Reduction Feature Construction Construction X 1,..., X D f

More information

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji Dynamic Data Modeling, Recognition, and Synthesis Rui Zhao Thesis Defense Advisor: Professor Qiang Ji Contents Introduction Related Work Dynamic Data Modeling & Analysis Temporal localization Insufficient

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive: Simple Variable Selection LASSO: Sparse Regression Machine Learning CSE546 Carlos Guestrin University of Washington October 7, 2013 1 Sparsity Vector w is sparse, if many entries are zero: Very useful

More information

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:

More information

VCMC: Variational Consensus Monte Carlo

VCMC: Variational Consensus Monte Carlo VCMC: Variational Consensus Monte Carlo Maxim Rabinovich, Elaine Angelino, Michael I. Jordan Berkeley Vision and Learning Center September 22, 2015 probabilistic models! sky fog bridge water grass object

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Mixtures of Gaussians with Sparse Structure

Mixtures of Gaussians with Sparse Structure Mixtures of Gaussians with Sparse Structure Costas Boulis 1 Abstract When fitting a mixture of Gaussians to training data there are usually two choices for the type of Gaussians used. Either diagonal or

More information

Active learning in sequence labeling

Active learning in sequence labeling Active learning in sequence labeling Tomáš Šabata 11. 5. 2017 Czech Technical University in Prague Faculty of Information technology Department of Theoretical Computer Science Table of contents 1. Introduction

More information

Active Learning and Optimized Information Gathering

Active Learning and Optimized Information Gathering Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office

More information

11. Learning graphical models

11. Learning graphical models Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical

More information

ABC-LogitBoost for Multi-Class Classification

ABC-LogitBoost for Multi-Class Classification Ping Li, Cornell University ABC-Boost BTRY 6520 Fall 2012 1 ABC-LogitBoost for Multi-Class Classification Ping Li Department of Statistical Science Cornell University 2 4 6 8 10 12 14 16 2 4 6 8 10 12

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Improved Bayesian Compression

Improved Bayesian Compression Improved Bayesian Compression Marco Federici University of Amsterdam marco.federici@student.uva.nl Karen Ullrich University of Amsterdam karen.ullrich@uva.nl Max Welling University of Amsterdam Canadian

More information

Surrogate regret bounds for generalized classification performance metrics

Surrogate regret bounds for generalized classification performance metrics Surrogate regret bounds for generalized classification performance metrics Wojciech Kotłowski Krzysztof Dembczyński Poznań University of Technology PL-SIGML, Częstochowa, 14.04.2016 1 / 36 Motivation 2

More information

Batch Mode Sparse Active Learning. Lixin Shi, Yuhang Zhao Tsinghua University

Batch Mode Sparse Active Learning. Lixin Shi, Yuhang Zhao Tsinghua University Batch Mode Sparse Active Learning Lixin Shi, Yuhang Zhao Tsinghua University Our work Propose an unified framework of batch mode active learning Instantiate the framework using classifiers based on sparse

More information

Evidence-Based Uncertainty Sampling for Active Learning

Evidence-Based Uncertainty Sampling for Active Learning Data Min Knowl Disc This is authors accepted version. Publisher s version is available at http://link.springer.com/article/10.1007/s10618-016-0460-3. Evidence-Based Uncertainty Sampling for Active Learning

More information

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Bayesian Learning (II)

Bayesian Learning (II) Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

More information

A Simple Algorithm for Multilabel Ranking

A Simple Algorithm for Multilabel Ranking A Simple Algorithm for Multilabel Ranking Krzysztof Dembczyński 1 Wojciech Kot lowski 1 Eyke Hüllermeier 2 1 Intelligent Decision Support Systems Laboratory (IDSS), Poznań University of Technology, Poland

More information

Multi-label learning from batch and streaming data

Multi-label learning from batch and streaming data Multi-label learning from batch and streaming data Jesse Read Télécom ParisTech École Polytechnique Summer School on Mining Big and Complex Data 5 September 26 Ohrid, Macedonia Introduction x = Binary

More information

Beyond Active Noun Tagging: Modeling Contextual Interactions for Multi-Class Active Learning

Beyond Active Noun Tagging: Modeling Contextual Interactions for Multi-Class Active Learning Beyond Active Noun Tagging: Modeling Contextual Interactions for Multi-Class Active Learning Behjat Siddiquie Department of Computer Science University of Maryland, College Park behjat@cs.umd.edu Abhinav

More information

Sparse Algorithms are not Stable: A No-free-lunch Theorem

Sparse Algorithms are not Stable: A No-free-lunch Theorem Sparse Algorithms are not Stable: A No-free-lunch Theorem Huan Xu Shie Mannor Constantine Caramanis Abstract We consider two widely used notions in machine learning, namely: sparsity algorithmic stability.

More information

Bayesian Learning Features of Bayesian learning methods:

Bayesian Learning Features of Bayesian learning methods: Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct. This provides a more

More information

Calibrated Uncertainty in Deep Learning

Calibrated Uncertainty in Deep Learning Calibrated Uncertainty in Deep Learning U NCERTAINTY IN DEEP LEARNING W ORKSHOP @ UAI18 Volodymyr Kuleshov August 10, 2018 Estimating Uncertainty is Crucial in Many Applications Assessing uncertainty can

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic

More information

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees Introduction to ML Two examples of Learners: Naïve Bayesian Classifiers Decision Trees Why Bayesian learning? Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical

More information

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of

More information

EM (cont.) November 26 th, Carlos Guestrin 1

EM (cont.) November 26 th, Carlos Guestrin 1 EM (cont.) Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University November 26 th, 2007 1 Silly Example Let events be grades in a class w 1 = Gets an A P(A) = ½ w 2 = Gets a B P(B) = µ

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models 02-710 Computational Genomics Systems biology Putting it together: Data integration using graphical models High throughput data So far in this class we discussed several different types of high throughput

More information

Large Scale Semi-supervised Linear SVMs. University of Chicago

Large Scale Semi-supervised Linear SVMs. University of Chicago Large Scale Semi-supervised Linear SVMs Vikas Sindhwani and Sathiya Keerthi University of Chicago SIGIR 2006 Semi-supervised Learning (SSL) Motivation Setting Categorize x-billion documents into commercial/non-commercial.

More information

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017 Machine Learning Regularization and Feature Selection Fabio Vandin November 14, 2017 1 Regularized Loss Minimization Assume h is defined by a vector w = (w 1,..., w d ) T R d (e.g., linear models) Regularization

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

Examples are not Enough, Learn to Criticize! Criticism for Interpretability

Examples are not Enough, Learn to Criticize! Criticism for Interpretability Examples are not Enough, Learn to Criticize! Criticism for Interpretability Been Kim, Rajiv Khanna, Oluwasanmi Koyejo Wittawat Jitkrittum Gatsby Machine Learning Journal Club 16 Jan 2017 1/20 Summary Examples

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

On Bayesian Computation

On Bayesian Computation On Bayesian Computation Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang Previous Work: Information Constraints on Inference Minimize the minimax risk under constraints

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

02 Background Minimum background on probability. Random process

02 Background Minimum background on probability. Random process 0 Background 0.03 Minimum background on probability Random processes Probability Conditional probability Bayes theorem Random variables Sampling and estimation Variance, covariance and correlation Probability

More information

Fantope Regularization in Metric Learning

Fantope Regularization in Metric Learning Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France Introduction

More information

CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs

CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs Professor Erik Sudderth Brown University Computer Science October 6, 2016 Some figures and materials courtesy

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Global vs. Multiscale Approaches

Global vs. Multiscale Approaches Harmonic Analysis on Graphs Global vs. Multiscale Approaches Weizmann Institute of Science, Rehovot, Israel July 2011 Joint work with Matan Gavish (WIS/Stanford), Ronald Coifman (Yale), ICML 10' Challenge:

More information