Stanford Statistics 311/Electrical Engineering 377
|
|
- Paulina Cameron
- 5 years ago
- Views:
Transcription
1 I. Bayes risk in classification problems a. Recall definition (1.2.3) of f-divergence between two distributions P and Q as ( ) p(x) D f (P Q) : q(x)f dx, q(x) where f : R + R is a convex function satisfying f(1) 0. If f is not linear, then D f (P Q) > 0 unless P Q. b. Focusing on binary classification case, let us consider some example risks and see what connections they have to f-divergences. (Recall we have X X and Y { 1,1} we would like to classify.) 1. We require a few definitions to understand the performance of different classification strategies. In particular, we consider the difference between the risk possible when we see a point to classify and when we do not. 2. The prior risk is the risk attainable without seeing x, we have for a fixed sign R the definition R prior () : P(Y 1)1{ 0}+P(Y 1)1{ 0}, (11.1.1) and similarly the minimal prior risk, defined as R prior : inf {P(Y 1)1{ 0}+P(Y 1)1{ 0}} min{p(y 1),P(Y 1)}. 3. Also have the prior φ-risk, defined as (11.1.2) R φ,prior () : P(Y 1)φ()+P(Y 1)φ( ), (11.1.3) and the minimal prior φ-risk, defined as R φ,prior : inf {P(Y 1)φ()+P(Y 1)φ( )}. (11.1.4) c. Examples of 0-1 loss and its friends: have X X and Y { 1,1}. 1. Example (Binary classification with 0-1 loss): What is Bayes risk of binary classifier? Let P(Y 1 X x)p(x) p +1 (x) p(x Y 1) P(Y 1) be the density of X conditional on Y 1 and similarly for p 1 (x), and assume that each class occurs with probability 1/2. Then R inf 1{(x) 0}P(Y 1 X x)+1{(x) 0}P(Y 1 X x)]p(x)dx 1 2 inf 1{(x) 0}p +1 (x)+1{(x) 0}p 1 (x)]dx 1 min{p +1 (x),p 1 (x)}dx. 2 Similarly, we may compute the minimal prior risk, which is simply 1 2 by definition (11.1.2). Looking at the gap between the two, we obtain Rprior R min{p +1 (x),p 1 (x)}dx 1 p 1 p 1 ] P 1 P 1 TV. That is, the difference is half the variation distance between P 1 and P 1, the distributions of x conditional on the label Y. 114
2 2. Example (Binary classification with hinge loss): We now repeat precisely the same calculations as in Example 11.11, but using as our loss the hinge loss (recall Example 11.2). In this case, the minimal φ-risk is Rφ inf 1 ]+ P(Y 1 X x)+1+] + P(Y 1 X x) ] p(x)dx 1 inf 1 ]+ p 1 (x)+1+] 2 + p 1 (x) ] dx min{p 1 (x),p 1 (x)}dx. We can similarly compute the prior risk as Rφ,prior 1. Now, when we calculate the improvement available via observing X x, we find that Rφ,prior R φ 1 min{p 1 (x),p 1 (x)}dx P 1 P 1 TV, which is suggestively similar to Example d. Is there anything more we can say about this? II. Statistical information, f-divergences, and classification problems a. Statistical information 1. Suppose we have a classification problem with data X X and labels Y { 1,1}. A natural notion of information that X carries about Y is the gap R prior R, (11.1.5) that between the prior risk and the risk attainable after viewing x X. 2. Didn t present this. True definition of statistical information: suppose class 1 has prior probability π and class 1 has prior 1 π, and let P 1 and P 1 be the distributions of X X given Y 1 and Y 1, respectively. The Bayes risk associated with the problem is then B π (P 1,P 1 ) : inf 1{(x) 0}p 1 (x)π +1{(x) 0}p 1 (x)(1 π)]dx (11.1.6) p 1 (x)π p 1 (x)(1 π)dx and similarly, the prior Bayes risk is B π : inf Then statistical information is {1{ 0}π +1{ 0}(1 π)} π (1 π). (11.1.7) B π B π (P 1,P 1 ). (11.1.8) 3. Measure proposed by DeGroot 1] in experimental design problem; goal is to infer state of world based on further experiments, want to measure quality of measurement. 4. Saw that for 0-1 loss, when a-priori each class was equally likely, then R prior R 1 2 P 1 P 1 TV, and similarly for hinge loss (Example 11.12) that R φ,prior R φ P 1 P 1 TV. 115
3 5. Note that if P 1 P 1, then the statistical information is positive. b. Did present this. More general story? Yes. 1. Consider any margin-based surrogate loss φ, and look at the difference between B φ,π (P 1,P 1 ) : inf φ((x))p 1 (x)π +φ( (x))p 1 (x)(1 π)]dx inf φ()p 1(x)π +φ( )p 1 (x)(1 π)]dx and the prior φ-risk, B φ,π. 2. Note that B φ,π B φ,π (P 1,P 1 ) is simply gap in φ-risk R φ,prior R φ for distribution with P(Y 1) π and P(Y y X x) p(x Y y)p(y y) p(x) p y(x)π 1{y1} (1 π) 1{y 1}. (11.1.9) πp 1 (x)+(1 π)p 1 (x) c. Have theorem (see, for example, Liese and Vajda 2], or Reid and Williamson 4]): Theorem Let P 1 and P 1 be arbitrary distributions on X, and let π 0,1] be a prior probability of a class label. Then there is a convex function f π,φ : R + R satisfying f π,φ (1) 0 such that Moreover, this function f π,φ is f π,φ (t) sup B φ,π B φ,π (P 1,P 1 ) D fπ,φ (P 1 P 1 ). l πφ()t+(1 π)φ( ) φ (π) πt+(1 π) ] (tπ +(1 π)). ( ) Proof First, consider the integrated Bayes risk. Recalling the definition of the conditional distribution η(x) P(Y 1 X x), we have l B φ,π B φ,π (P 1,P 1 ) φ (π) l φ (η(x))] p(x)dx sup l φ (π) φ()p(y 1 x) φ( )P(Y 1 x) ] p(x)dx sup l φ (π) φ()p 1(x)π φ( ) p ] 1(x)(1 π) p(x)dx, p(x) p(x) where we have used Bayes rule as in (11.1.9). Let us now divide all appearances of the density p 1 by p 1, which yields B φ,π B φ,π (P 1,P 1 ) sup l φ (π) φ() p 1(x) p 1 (x) π +φ( )(1 π) ( ) p1 (x) p 1 (x) p 1 (x) π +(1 π) p 1 (x) π +(1 π) p 1 (x)dx. ( ) 116
4 By inspection, the representation ( ) gives the result of the theorem if we can argue that the function f π is convex, where we substitute p 1 (x)/p 1 (x) for t in f π (t). To see that the function f π is convex, consider the intermediate function s π (u) : sup{ πφ()u (1 π)φ( )}. This is the supremum of a family of linear functions in the variable u, so it is convex. Moreover, as we noted in the first exercise set, the perspective of a convex function g, defined by h(u,t) tg(u/t) for t 0, is jointly convex in u and t. Thus, as f π (t) l φ (π)+s π ( t πt+(1 π) ) (πt+(1 π)), we have that f π is convex. It is clear that f π (1) 0 by definition of l φ (π). d. Take-home: any loss function induces an associated f-divergence. (There is a complete converse, in that any f-divergence can be realized as the difference in prior and posterior Bayes risk for some loss function; see, for example, Liese and Vajda 2] for results of this type.) III. Quantization and other types of empirical minimization a. Do these equivalences mean anything? What about the fact that the suboptimality function H φ was linear for the hinge loss? b. Consider problems with quantization: we must jointly learn a classifier (prediction or discriminant function) and a quantizer q : X {1,...,k}, where k is fixed and we wish to find an optimal quantizer q Q, where Q is some family of quantizers. Recall the notation (1.2.1) of quantization of f-divergence, so D f (P 0 P 1 q) P 1 (q 1 (i))f i1 ( P0 (q 1 (i)) P 1 (q 1 (i)) ) P 1 (A i )f i1 ( ) P0 (A i ) P 1 (A i ) where the A i are the quantization regions of X. c. Using Theorem 11.13, we can show how quantization and learning can be unified. 1. Quantized version of risk: for q : X {1,...,k} and : k] R, R φ ( q) Eφ(Y(q(X)))] 2. Rearranging and using integration, R φ ( q) Eφ(Y(q(X)))] Eφ(Y(z)) q(x) z]p(q(x) z) φ((z))p(y 1 q(x) z)+φ( (z))p(y 1 q(x) z)]p(q(x) z) ] P(q(X) z Y 1)P(Y 1) P(q(X) z Y 1)P(Y 1) φ((z)) +φ( (z)) P(q(X) P(q(X) z) P(q(X) z) φ((z))p 1 (q(x) z)π +φ( (z))p 1 (q(x) z)(1 π)]. 117
5 3. Let P q denote the distribution with probability mass function and define quantized Bayes φ-risk P q (z) P(q(X) z) P(q 1 ({z})), Rφ (q) inf R φ( q) Then for problem with P(Y 1) π, we have R φ,prior R φ (q) B φ,π B φ,π (P q 1,Pq 1 ) D f π,φ (P 1 P 1 q). ( ) d. Result unifying quantization and learning: we say that loss functions φ 1 and φ 2 are universally equivalent if they induce the same f divergence ( ), that is, there is a constant c > 0 and a,b R such that f π,φ1 (t) cf π,φ2 (t)+at+b for all t. ( ) Theorem Let φ 1 and φ 2 be equivalent margin-based surrogate loss functions. Then for any quantizers q 1 and q 2, R φ 1 (q 1 ) R φ 1 (q 2 ) if and only if R φ 2 (q 1 ) cr φ 2 (q 2 ). Proof The proof follows straightforwardly via the representation ( ). If φ 1 and φ 2 are equivalent, then we have that R φ 1,prior R φ 1 (q) D fπ,φ1 (P 1 P 1 q) cd fπ,φ2 (P 1 P 1 q)+a+b for any quantizer q. In particular, we have c R φ 2,prior R φ 2 (q) ] +a+b R φ 1 (q 1 ) R φ 1 (q 2 ) if and only if R φ 1,prior R φ 1 (q 1 ) R φ 1,prior R φ 1 (q 2 ) if and only if D fπ,φ1 (P 1 P 1 q 1 ) D fπ,φ1 (P 1 P 1 q 2 ) if and only if D fπ,φ2 (P 1 P 1 q 1 ) D fπ,φ2 (P 1 P 1 q 2 ) if and only if R φ 2,prior R φ 2 (q 1 ) R φ 2,prior R φ 2 (q 2 ). Subtracting R φ 2,prior from both sides gives our desired result. e. Some comments: 1. We have an interesting thing: if we wish to learn a quantizer and a classifier jointly, then this is possible by using any loss equivalent to the true loss we care about. 2. Example: hinge loss and 0-1 loss are equivalent. 3. Turns out that the condition that the losses φ 1 and φ 2 be equivalent is (essentially) necessary and sufficient for two quantizers to induce the same ordering 3]. This equivalence is necessary and sufficient for the ordering conclusion of Theorem
6 Bibliography 1] M. DeGroot. Optimal Statistical Decisions. Mcgraw-Hill College, ] F. Liese and I. Vajda. On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 52(10): , ] X. Nguyen, M. J. Wainwright, and M. I. Jordan. On surrogate loss functions and f-divergences. Annals of Statistics, 37(2): , ] M. Reid and R. Williamson. Information, divergence, and risk for binary experiments. Journal of Machine Learning Research, 12: ,
Convexity, Detection, and Generalized f-divergences
Convexity, Detection, and Generalized f-divergences Khashayar Khosravi Feng Ruan John Duchi June 5, 015 1 Introduction The goal of classification problem is to learn a discriminant function for classification
More informationBayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI
Bayes Classifiers CAP5610 Machine Learning Instructor: Guo-Jun QI Recap: Joint distributions Joint distribution over Input vector X = (X 1, X 2 ) X 1 =B or B (drinking beer or not) X 2 = H or H (headache
More informationDivergences, surrogate loss functions and experimental design
Divergences, surrogate loss functions and experimental design XuanLong Nguyen University of California Berkeley, CA 94720 xuanlong@cs.berkeley.edu Martin J. Wainwright University of California Berkeley,
More informationIntroduction to Bayesian Statistics
School of Computing & Communication, UTS January, 207 Random variables Pre-university: A number is just a fixed value. When we talk about probabilities: When X is a continuous random variable, it has a
More informationSurrogate loss functions, divergences and decentralized detection
Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1
More informationDecentralized decision making with spatially distributed data
Decentralized decision making with spatially distributed data XuanLong Nguyen Department of Statistics University of Michigan Acknowledgement: Michael Jordan, Martin Wainwright, Ram Rajagopal, Pravin Varaiya
More informationOn Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing
On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing XuanLong Nguyen Martin J. Wainwright Michael I. Jordan Electrical Engineering & Computer Science Department
More informationRobustness and duality of maximum entropy and exponential family distributions
Chapter 7 Robustness and duality of maximum entropy and exponential family distributions In this lecture, we continue our study of exponential families, but now we investigate their properties in somewhat
More informationConsistency of Nearest Neighbor Methods
E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study
More informationBayes rule and Bayes error. Donglin Zeng, Department of Biostatistics, University of North Carolina
Bayes rule and Bayes error Definition If f minimizes E[L(Y, f (X))], then f is called a Bayes rule (associated with the loss function L(y, f )) and the resulting prediction error rate, E[L(Y, f (X))],
More informationAre You a Bayesian or a Frequentist?
Are You a Bayesian or a Frequentist? Michael I. Jordan Department of EECS Department of Statistics University of California, Berkeley http://www.cs.berkeley.edu/ jordan 1 Statistical Inference Bayesian
More informationLecture 10 February 23
EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009 Lecture 10 February 23 Lecturer: Martin Wainwright Scribe: Dave Golland Note: These lecture notes are still rough, and have only
More informationBayes Decision Theory
Bayes Decision Theory Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 / 16
More informationApproximation Theoretical Questions for SVMs
Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More informationLearning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014
Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of
More informationExpectation Propagation for Approximate Bayesian Inference
Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given
More informationSTATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION
STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationThe Bayes classifier
The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationGaussian Processes for Machine Learning
Gaussian Processes for Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics Tübingen, Germany carl@tuebingen.mpg.de Carlos III, Madrid, May 2006 The actual science of
More informationIntroduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak
Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,
More informationLecture Notes for Statistics 311/Electrical Engineering 377. John Duchi
Lecture Notes for Statistics 311/Electrical Engineering 377 March 13, 019 Contents 1 Introduction and setting 6 1.1 Information theory..................................... 6 1. Moving to statistics....................................
More informationCalibrated Surrogate Losses
EECS 598: Statistical Learning Theory, Winter 2014 Topic 14 Calibrated Surrogate Losses Lecturer: Clayton Scott Scribe: Efrén Cruz Cortés Disclaimer: These notes have not been subjected to the usual scrutiny
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationStatistical Learning Theory
Statistical Learning Theory Part I : Mathematical Learning Theory (1-8) By Sumio Watanabe, Evaluation : Report Part II : Information Statistical Mechanics (9-15) By Yoshiyuki Kabashima, Evaluation : Report
More informationOn surrogate loss functions and f-divergences
On surrogate loss functions and f-divergences XuanLong Nguyen, Martin J. Wainwright, xuanlong.nguyen@stat.duke.edu wainwrig@stat.berkeley.edu Michael I. Jordan, jordan@stat.berkeley.edu Department of Statistical
More informationClassification objectives COMS 4771
Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationGenerative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul
Generative Learning INFO-4604, Applied Machine Learning University of Colorado Boulder November 29, 2018 Prof. Michael Paul Generative vs Discriminative The classification algorithms we have seen so far
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationSupervised Learning: Non-parametric Estimation
Supervised Learning: Non-parametric Estimation Edmondo Trentin March 18, 2018 Non-parametric Estimates No assumptions are made on the form of the pdfs 1. There are 3 major instances of non-parametric estimates:
More informationClassification with Reject Option
Classification with Reject Option Bartlett and Wegkamp (2008) Wegkamp and Yuan (2010) February 17, 2012 Outline. Introduction.. Classification with reject option. Spirit of the papers BW2008.. Infinite
More informationProbabilistic Graphical Models for Image Analysis - Lecture 1
Probabilistic Graphical Models for Image Analysis - Lecture 1 Alexey Gronskiy, Stefan Bauer 21 September 2018 Max Planck ETH Center for Learning Systems Overview 1. Motivation - Why Graphical Models 2.
More informationCPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017
CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class
More informationRepresentation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2
Representation Stefano Ermon, Aditya Grover Stanford University Lecture 2 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 1 / 32 Learning a generative model We are given a training
More informationThe sample complexity of agnostic learning with deterministic labels
The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College
More informationON SURROGATE LOSS FUNCTIONS AND f -DIVERGENCES 1
The Annals of Statistics 2009, Vol. 37, No. 2, 876 904 DOI: 10.1214/08-AOS595 Institute of Mathematical Statistics, 2009 ON SURROGATE LOSS FUNCTIONS AND f -DIVERGENCES 1 BY XUANLONG NGUYEN, MARTIN J. WAINWRIGHT
More informationLearning from Corrupted Binary Labels via Class-Probability Estimation
Learning from Corrupted Binary Labels via Class-Probability Estimation Aditya Krishna Menon Brendan van Rooyen Cheng Soon Ong Robert C. Williamson xxx National ICT Australia and The Australian National
More informationADECENTRALIZED detection system typically involves a
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL 53, NO 11, NOVEMBER 2005 4053 Nonparametric Decentralized Detection Using Kernel Methods XuanLong Nguyen, Martin J Wainwright, Member, IEEE, and Michael I Jordan,
More information01 Probability Theory and Statistics Review
NAVARCH/EECS 568, ROB 530 - Winter 2018 01 Probability Theory and Statistics Review Maani Ghaffari January 08, 2018 Last Time: Bayes Filters Given: Stream of observations z 1:t and action data u 1:t Sensor/measurement
More informationComputational and Statistical Learning theory
Computational and Statistical Learning theory Problem set 2 Due: January 31st Email solutions to : karthik at ttic dot edu Notation : Input space : X Label space : Y = {±1} Sample : (x 1, y 1,..., (x n,
More informationSurrogate Risk Consistency: the Classification Case
Chapter 11 Surrogate Risk Consistency: the Classification Case I. The setting: supervised prediction problem (a) Have data coming in pairs (X,Y) and a loss L : R Y R (can have more general losses) (b)
More informationLecture 35: December The fundamental statistical distances
36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose
More informationProbabilistic and Bayesian Machine Learning
Probabilistic and Bayesian Machine Learning Day 4: Expectation and Belief Propagation Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 9: Variational Inference Relaxations Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 24/10/2011 (EPFL) Graphical Models 24/10/2011 1 / 15
More informationCycle-Consistent Adversarial Learning as Approximate Bayesian Inference
Cycle-Consistent Adversarial Learning as Approximate Bayesian Inference Louis C. Tiao 1 Edwin V. Bonilla 2 Fabio Ramos 1 July 22, 2018 1 University of Sydney, 2 University of New South Wales Motivation:
More informationLecture Notes 15 Prediction Chapters 13, 22, 20.4.
Lecture Notes 15 Prediction Chapters 13, 22, 20.4. 1 Introduction Prediction is covered in detail in 36-707, 36-701, 36-715, 10/36-702. Here, we will just give an introduction. We observe training data
More informationFast Rates for Estimation Error and Oracle Inequalities for Model Selection
Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu
More informationIntro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation
Lecture 15. Pattern Classification (I): Statistical Formulation Outline Statistical Pattern Recognition Maximum Posterior Probability (MAP) Classifier Maximum Likelihood (ML) Classifier K-Nearest Neighbor
More informationStatistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003
Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More informationGeneralization bounds
Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationMaterial presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010.
Material presented Direct Models for Classification SCARF JHU Summer School June 18, 2010 Patrick Nguyen (panguyen@microsoft.com) What is classification? What is a linear classifier? What are Direct Models?
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationLecture 21: Minimax Theory
Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways
More informationLoss Functions. 1 Introduction. Robert C. Williamson
Loss Functions Robert C. Williamson Abstract Vapnik described the three main learning problems of pattern recognition, regression estimation and density estimation. These are defined in terms of the loss
More informationSVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels
SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic
More informationClass Prior Estimation from Positive and Unlabeled Data
IEICE Transactions on Information and Systems, vol.e97-d, no.5, pp.1358 1362, 2014. 1 Class Prior Estimation from Positive and Unlabeled Data Marthinus Christoffel du Plessis Tokyo Institute of Technology,
More information5. Conditional Distributions
1 of 12 7/16/2009 5:36 AM Virtual Laboratories > 3. Distributions > 1 2 3 4 5 6 7 8 5. Conditional Distributions Basic Theory As usual, we start with a random experiment with probability measure P on an
More informationBayesian decision making
Bayesian decision making Václav Hlaváč Czech Technical University in Prague Czech Institute of Informatics, Robotics and Cybernetics 166 36 Prague 6, Jugoslávských partyzánů 1580/3, Czech Republic http://people.ciirc.cvut.cz/hlavac,
More informationRegret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss
Regret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss Krzysztof Dembczyński 1, Willem Waegeman 2, Weiwei Cheng 1, and Eyke Hüllermeier 1 1 Knowledge
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationIndirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina
Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection
More informationBayesian Decision Theory
Introduction to Pattern Recognition [ Part 4 ] Mahdi Vasighi Remarks It is quite common to assume that the data in each class are adequately described by a Gaussian distribution. Bayesian classifier is
More informationMachine Learning 2017
Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section
More informationA Bahadur Representation of the Linear Support Vector Machine
A Bahadur Representation of the Linear Support Vector Machine Yoonkyung Lee Department of Statistics The Ohio State University October 7, 2008 Data Mining and Statistical Learning Study Group Outline Support
More informationStatistical Machine Learning Lectures 4: Variational Bayes
1 / 29 Statistical Machine Learning Lectures 4: Variational Bayes Melih Kandemir Özyeğin University, İstanbul, Turkey 2 / 29 Synonyms Variational Bayes Variational Inference Variational Bayesian Inference
More informationOn divergences, surrogate loss functions, and decentralized detection
On divergences, surrogate loss functions, and decentralized detection XuanLong Nguyen Computer Science Division University of California, Berkeley xuanlong@eecs.berkeley.edu Martin J. Wainwright Statistics
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationSection 1.7 Proof Methods and Strategy. Existence Proofs. xp(x). Constructive existence proof:
Section 1.7 Proof Methods and Strategy Existence Proofs We wish to establish the truth of xp(x). Constructive existence proof: - Establish P(c) is true for some c in the universe. - Then xp(x) is true
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationThe Naïve Bayes Classifier. Machine Learning Fall 2017
The Naïve Bayes Classifier Machine Learning Fall 2017 1 Today s lecture The naïve Bayes Classifier Learning the naïve Bayes Classifier Practical concerns 2 Today s lecture The naïve Bayes Classifier Learning
More informationLecture 22: Error exponents in hypothesis testing, GLRT
10-704: Information Processing and Learning Spring 2012 Lecture 22: Error exponents in hypothesis testing, GLRT Lecturer: Aarti Singh Scribe: Aarti Singh Disclaimer: These notes have not been subjected
More informationBINARY CLASSIFICATION
BINARY CLASSIFICATION MAXIM RAGINSY The problem of binary classification can be stated as follows. We have a random couple Z = X, Y ), where X R d is called the feature vector and Y {, } is called the
More informationFORMULATION OF THE LEARNING PROBLEM
FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we
More informationRicco Rakotomalala.
Ricco.Rakotomalala@univ-lyon2.fr Tanagra tutorials - http://data-mining-tutorials.blogspot.fr/ 1 Dataset Variables, attributes, Success Wages Job Refunding Y 0 Unemployed Slow N 2000 Skilled Worker Slow
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More informationCOM336: Neural Computing
COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk
More information12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016
12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses
More informationConcentration Inequalities
Chapter Concentration Inequalities I. Moment generating functions, the Chernoff method, and sub-gaussian and sub-exponential random variables a. Goal for this section: given a random variable X, how does
More information1 Differential Privacy and Statistical Query Learning
10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 5: December 07, 015 1 Differential Privacy and Statistical Query Learning 1.1 Differential Privacy Suppose
More informationProblem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30
Problem Set MAS 6J/1.16J: Pattern Recognition and Analysis Due: 5:00 p.m. on September 30 [Note: All instructions to plot data or write a program should be carried out using Matlab. In order to maintain
More informationProbabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013
School of Computer Science Probabilistic Graphical Models Theory of Variational Inference: Inner and Outer Approximation Junming Yin Lecture 15, March 4, 2013 Reading: W & J Book Chapters 1 Roadmap Two
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationThe Game of Twenty Questions with noisy answers. Applications to Fast face detection, micro-surgical tool tracking and electron microscopy
The Game of Twenty Questions with noisy answers. Applications to Fast face detection, micro-surgical tool tracking and electron microscopy Graduate Summer School: Computer Vision July 22 - August 9, 2013
More informationMachine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)
More informationAdaptive Sampling Under Low Noise Conditions 1
Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università
More informationCS Machine Learning Qualifying Exam
CS Machine Learning Qualifying Exam Georgia Institute of Technology March 30, 2017 The exam is divided into four areas: Core, Statistical Methods and Models, Learning Theory, and Decision Processes. There
More information16 : Markov Chain Monte Carlo (MCMC)
10-708: Probabilistic Graphical Models 10-708, Spring 2014 16 : Markov Chain Monte Carlo MCMC Lecturer: Matthew Gormley Scribes: Yining Wang, Renato Negrinho 1 Sampling from low-dimensional distributions
More informationLecture 1a: Basic Concepts and Recaps
Lecture 1a: Basic Concepts and Recaps Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced
More informationMGMT 69000: Topics in High-dimensional Data Analysis Falll 2016
MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 14: Information Theoretic Methods Lecturer: Jiaming Xu Scribe: Hilda Ibriga, Adarsh Barik, December 02, 2016 Outline f-divergence
More informationWriting proofs for MATH 51H Section 2: Set theory, proofs of existential statements, proofs of uniqueness statements, proof by cases
Writing proofs for MATH 51H Section 2: Set theory, proofs of existential statements, proofs of uniqueness statements, proof by cases September 22, 2018 Recall from last week that the purpose of a proof
More informationDS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM
DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM Due: Monday, April 11, 2016, at 6pm (Submit via NYU Classes) Instructions: Your answers to
More informationIntroduction to Machine Learning
Outline Introduction to Machine Learning Bayesian Classification Varun Chandola March 8, 017 1. {circular,large,light,smooth,thick}, malignant. {circular,large,light,irregular,thick}, malignant 3. {oval,large,dark,smooth,thin},
More informationProbabilistic Machine Learning. Industrial AI Lab.
Probabilistic Machine Learning Industrial AI Lab. Probabilistic Linear Regression Outline Probabilistic Classification Probabilistic Clustering Probabilistic Dimension Reduction 2 Probabilistic Linear
More informationLecture Support Vector Machine (SVM) Classifiers
Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in
More information