A Study of Relative Efficiency and Robustness of Classification Methods


 Madlyn Owen
 1 years ago
 Views:
Transcription
1 A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics Seoul National University
2 Iris Data (red=setosa,green=versicolor,blue=virginica) Sepal.Length Sepal.Width Petal.Length Petal.Width
3 Figure: courtesy of Hastie, Tibshirani, & Friedman (2001) Handwritten digit recognition Cancer diagnosis with gene expression profiles Text categorization
4 Classification x = (x 1,..., x d ) R d y Y = {1,..., k} Learn a rule φ : R d Y from the training data {(x i, y i ), i = 1,..., n}, where (x i, y i ) are i.i.d. with P(X, Y). The 01 loss function: ρ(y,φ(x)) = I(y φ(x)) The Bayes decision rule φ B minimizing the error rate R(φ) = P(Y φ(x)) is φ B (x) = arg max P(Y = k X = x). k
5 Statistical Modeling: The Two Cultures One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. Breiman (2001) Modelbased methods in statistics: LDA, QDA, logistic regression, kernel density classification Algorithmic methods in machine learning: Support vector machine (SVM), boosting, decision trees, neural network Less is required in pattern recognition. Devroye, Györfi and Lugosi (1996) If you possess a restricted information for solving some problem, try to solve the problem directly and never solve a general problem as an intermediate step. Vapnik (1998)
6 Questions Is modeling necessary for classification? Does modeling lead to more accurate classification? How to quantify the relative efficiency? How do the two approaches compare?
7 Convex Risk Minimization In the binary case (k = 2), suppose that y = 1 or 1. Typically obtain a discriminant function f : R d R, which induces a classifier φ(x) = sign(f(x)), by minimizing the risk under a convex surrogate loss of the 01 loss ρ(y, f(x)) = I(yf(x) 0). Logistic regression: binomial deviance ( log likelihood) Support vector machine: hinge loss Boosting: exponential loss
8 Logistic Regression Agresti (2002), Categorical Data Analysis Model the conditional distribution p k (x) = P(Y = k X = x) directly. p 1 (x) log 1 p 1 (x) = f(x) Then Y X = x Bernoulli distribution with p 1 (x) = exp(f(x)) 1+exp(f(x)) and p 1(x) = 1 1+exp(f(x)). Maximizing the conditional likelihood of (y 1,...,y n ) given (x 1,..., x n ) (or minimizing the negative log likelihood) amounts to min f n log ( 1+exp( y i f(x i )) ). i=1
9 Support Vector Machine Vapnik (1996), The Nature of Statistical Learning Theory Find f with a large margin minimizing 1 n n (1 y i f(x i )) + +λ f 2. i=1 x margin=2/ β β t x + β 0 = 1 β t x + β 0 = 0 β t x + β 0 = x 1
10 Boosting Freund and Schapire (1997), A decisiontheoretic generalization of online learning and an application to boosting A metaalgorithm that combines the outputs of many weak classifiers to form a powerful committee Sequentially apply a weak learner to produce a sequence of classifiers f m (x), m = 1, 2,..., M and take a weighted majority vote for the final prediction. AdaBoost minimizes the exponential risk function with a stagewise gradient descent algorithm: min f n exp( y i f(x i )). i=1 Friedman, Hastie, and Tibshirani (2000), Additive Logistic Regression: A Statistical View of Boosting
11 Loss Functions ÄÓ Misclassification Exponential Binomial Deviance Squared Error Support Vector Ý Figure: courtesy of HTF (2001)
12 Classification Consistency The population minimizer f of ρ is defined as f with the minimum risk R(f) = Eρ(Y, f(x)). Negative loglikelihood (deviance) Hinge loss (SVM) f (x) = log Exponential loss (boosting) p 1 (x) 1 p 1 (x) f (x) = sign(p 1 (x) 1/2) f (x) = 1 2 log p 1 (x) 1 p 1 (x) sign(f ) yields the Bayes rule. Both modeling and algorithmic approaches are consistent. Lin (2000), Zhang (AOS 2004), Bartlett, Jordan, and McAuliffe (JASA 2006)
13 Outline Efron s comparison of LDA with logistic regression Efficiency of algorithmic approach (support vector machine and boosting) Simulation studies for comparison of efficiency and robustness Discussion
14 Normal Distribution Setting Two multivariate normal distributions in R d with mean vectors µ 1 and µ 2 and a common covariance matrix Σ π + = P(Y = 1) and π = P(Y = 1). For example, when π + = π, Fisher s LDA boundary is { } { Σ 1 (µ 1 µ 2 ) x 1 } 2 (µ 1 +µ 2 ) = 0. µ 1 µ 2
15 Canonical LDA setting Efron (JASA 1975), The Efficiency of Logistic Regression Compared to Normal Discriminant Analysis X N(( /2)e 1, I) for Y = 1 with probability π + X N( ( /2)e 1, I) for Y = 1 with probability π where = {(µ 1 µ 2 ) Σ 1 (µ 1 µ 2 )} 1/2 Fisher s linear discriminant function is l(x) = log(π + /π )+ x 1. Let β 0 = log(π +/π ), (β 1,...,β d ) = e 1, and β = (β 0,...,β d ).
16 Excess Error For a linear discriminant method ˆl with coefficient vector ˆβ n, if n(ˆβ n β ) N(0,Σ β ), the expected increased error rate of ˆl, E(R(ˆl) R(φ B )) = π [ +φ(d 1 ) σ 00 2β 0 2 n σ 01+ β 2 0 where D 1 = /2+(1/ ) log(π + /π ). 2σ 11+σ σ dd ] +o( 1 n ), In particular, when π + = π, E(R(ˆl) R(φ B )) = φ( /2) ] [σ 00 +σ σ dd + o( 1 4 n n ).
17 Relative Efficiency Efron (1975) studied the Asymptotic Relative Efficiency (ARE) of logistic regression (LR) to normal discrimination (LDA) defined as E(R(ˆl lim LDA ) R(φ B )) n E(R(ˆl LR ) R(φ B )). Logistic regression is shown to be between one half and two thirds as effective as normal discrimination typically.
18 General Framework for Comparison Identify the limiting distribution of ˆβ n for other classification procedures (SVM, boosting, etc.) under the canonical LDA setting: 1 n ˆβ n = arg min ρ(y i, x i ;β) β n i=1 Need large sample theory for Mestimators. Find the excess error of each method and compute the efficiency relative to LDA.
19 Mestimator Asymptotics Pollard (ET 1991), Hjort and Pollard (1993), Geyer (AOS 1994), Knight and Fu (AOS 2000), Rocha, Wang and Yu (2009) Convexity of the loss ρ is the key. Let L(β) = Eρ(Y, X;β), β = arg min L(β), H(β) = 2 L(β), and G(β) = E β β Under some regularity conditions, ( )( ρ(y, X;β) ρ(y, X;β) β n(ˆβn β ) N(0, H(β ) 1 G(β )H(β ) 1 ) β ) in distribution.
20 Linear SVM Koo, Lee, Kim, and Park (JMLR 2008), A Bahadur Representation of the Linear Support Vector Machine With β = (β 0, w ), l(x;β) = w x +β { 0 } 1 n βλ,n = arg min (1 y i l(x i ;β)) + +λ w 2 β n i=1 Under the canonical LDA setting with π + = π, for λ = o(n 1/2 ), n ( βλ,n β SVM ) N(0,Σ β SVM ), where β SVM = 2 (2a + ) β LDA and a is a constant such that φ(a )/Φ(a ) = /2. If π + π, ŵ n w LDA but ˆβ 0 is inconsistent.
21 Relative Efficiency of SVM to LDA Under the canonical LDA setting with π + = π = 0.5, the ARE of the linear SVM to LDA is Eff = 2 2 (1+ 4 )φ(a ). R(φ B ) a SVM LR
22 Boosting 1 βn = arg min β n n exp( y i l(x i ;β)) i=1 Under the canonical LDA setting with π + = π, where β boost = 1 2 β LDA. n ( βn β boost ) N(0,Σ β boost ), In general, β n is a consistent estimator of (1/2)β LDA.
23 Relative Efficiency of Boosting to LDA Under the canonical LDA setting with π + = π = 0.5, the ARE of Boosting to LDA is Eff = 1+ 2 /4 exp( 2 /4). R(φ B ) Boosting SVM LR
24 Smooth SVM Lee and Mangasarian (2001), SSVM: A Smooth Support Vector Machine { } 1 n βλ,n = arg min (1 y i l(x i ;β)) 2 + β n +λ w 2 i=1 Under the canonical LDA setting with π + = π, for λ = o(n 1/2 ), n ( βλ,n β SSVM ) N(0,Σ β SSVM ), where β SSVM = 2 (2a + ) β LDA and a is a constant such that {a Φ(a )+φ(a )} = 2Φ(a ).
25 Relative Efficiency of SSVM to LDA Under the canonical LDA setting with π + = π = 0.5, the ARE of the Smooth SVM to LDA is Eff = (4+ 2 )Φ(a ) (2a + ). R(φ B ) a SSVM SVM LR
26 Possible Explanation for Increased Efficiency Hastie, Tibshirani, and Friedman (2001), Elements of Statistical Learning There is a close connection between Fisher s LDA and regression approach to classification with class indicators: n min (y i β 0 w x i ) 2 = i=1 n (1 y i (β 0 + w x i )) 2 i=1 The least squares coefficient is identical up to a scalar multiple to the LDA coefficient: ŵ ˆΣ 1 (ˆµ 1 ˆµ 2 )
27 FiniteSample Excess Error Excess Error LDA (1.2578) LR (1.5021) SVM (1.7028) SSVM (1.4674) Sample Size (inverse scale) Figure: = 2, d = 5, R(φ B ) = , and π + = π. The results are based on 1000 replicates.
28 What If Model is Misspecified? All models are wrong, but some are useful. George Box Compare methods under A mixture of two Gaussian distributions Mislabeling in LDA setting Quadratic discriminant analysis setting
29 A Mixture of Two Gaussian Distributions µ f2 W µ f1 µ g2 B µ g1 Figure: W and B indicate the mean difference between two Gaussian components within each class and the mean difference between two classes.
30 As W Varies Error Rate Bayes Rule LDA LR SVM SSVM Figure: B = 2, d = 5, π + = π, π 1 = π 2, and n = 100 w
31 As B Varies Increased Error Rate LDA LR SVM SSVM Figure: W = 1, d = 5, π + = π, π 1 = π 2, and n = 100 B
32 As Dimension d Varies Error Rate Bayes Rule LDA LR SVM SSVM Dimension Figure: W = 1, B = 2, π + = π, π 1 = π 2, and n = 100
33 Mislabeling in LDA excess error SVM Huberized SVM Smooth SVM perturbation fraction Figure: Mean excess errors of SVM and its variants from 400 replicates as the mislabeling proportion varies. = 2.7, d = 5, R(φ B ) = , π + = π, and n = 100.
34 QDA Setting C= 1 C= µ µ µ µ C= 3 C= µ 2 µ µ 2 µ Figure: X Y = 1 N(µ 1,Σ) and X Y = 1 N(µ 2, CΣ)
35 Decomposition of Error For a rule φ F, R(φ) R(φ B ) = R(φ) R(φ F ) + R(φ }{{} F ) R(φ B ) }{{} estimation error approximation error where φ F = arg min φ F R(φ). When a method M is used to choose φ from F, R(φ) R(φ F ) = R(φ) R(φ M ) }{{} + R(φ M ) R(φ F ) }{{} Mspecific est.error Mspecific approx.error where φ M is the methodspecific limiting rule within F.
36 Approximation Error Error Rate Bayes Error Approximation Error Figure: QDA setting with = 2, Σ = I, d = 10, and π + = π C
37 MethodSpecific Approximation Error Approximation Error LDA SVM SSVM Boosting Figure: Methodspecific approximation error of linear classifiers in the QDA setting C
38 Extensions For high dimensional data, study double asymptotics where d also grows with n. Compare methods in a regularization framework. Investigate consistency and relative efficiency under other models. Also consider potential model misspecification and compare methods in terms of robustness.
39 Concluding Remarks Compared modeling approach with algorithmic approach in the efficiency of reducing error rates. Under the normal setting, modeling leads to more efficient use of data. Linear SVM is shown to be between 40% and 67% as effective as LDA when the Bayes error rate is between 4% and 10%. A loss function plays an important role in determining the efficiency of the corresponding procedure. Squared hinge loss could yield more effective procedure than logistic regression. There is a tradeoff between efficiency and robustness. The theoretical comparisons can be extended in many directions.
40 References P.L. Bartlett, M.I. Jordan, and J.D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101: , B. Efron. The efficiency of logistic regression compared to normal discriminant analysis. Journal of the American Statistical Association, 70(352): , Dec N. L. Hjort and D. Pollard. Asymptotics for minimisers of convex processes. Statistical Research Report, May JaYong Koo, Yoonkyung Lee, Yuwon Kim, and Changyi Park. A Bahadur representation of the linear Support Vector Machine. Journal of Machine Learning Research, 9: , 2008.
41 Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 32(1):56 85, Y.J. Lee and O.L. Mangasarian. SSVM: A smooth support vector machine. Computational Optimization and Applications, 20:5 22, Y. Lin. A note on marginbased loss functions in classification. Statististics and Probability Letters, 68:73 82, D. Pollard. Asymptotics for least absolute deviation regression estimators. Econometric Theory, 7: , G. Rocha, X. Wang, and B. Yu. Asymptotic distribution and sparsistency for l 1 penalized parametric Mestimators, with applications to linear SVM and logistic regression. arxiv, v1:1 55, Aug 2009.
Does Modeling Lead to More Accurate Classification?
Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang
More informationA Bahadur Representation of the Linear Support Vector Machine
A Bahadur Representation of the Linear Support Vector Machine Yoonkyung Lee Department of Statistics The Ohio State University October 7, 2008 Data Mining and Statistical Learning Study Group Outline Support
More informationDoes Modeling Lead to More Accurate Classification?: A Study of Relative Efficiency
Does Modeling Lead to More Accurate Classification?: A Study of Relative Efficiency Yoonkyung Lee, The Ohio State University Rui Wang, The Ohio State University Technical Report No. 869 July, 01 Department
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationRegularization Paths
December 2005 Trevor Hastie, Stanford Statistics 1 Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Saharon Rosset, Ji Zhu, Hui Zhou, Rob Tibshirani and
More informationChap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University
Chap 2. Linear Classifiers (FTH, 4.14.4) Yongdai Kim Seoul National University Linear methods for classification 1. Linear classifiers For simplicity, we only consider twoclass classification problems
More informationRegularization Paths. Theme
June 00 Trevor Hastie, Stanford Statistics June 00 Trevor Hastie, Stanford Statistics Theme Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, MeeYoung Park,
More informationClassification CE717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationMachine Learning. RegressionBased Classification & Gaussian Discriminant Analysis. Manfred Huber
Machine Learning RegressionBased Classification & Gaussian Discriminant Analysis Manfred Huber 2015 1 Logistic Regression Linear regression provides a nice representation and an efficient solution to
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationOutline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation
More informationAdaBoost and other Large Margin Classifiers: Convexity in Classification
AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationMargin Maximizing Loss Functions
Margin Maximizing Loss Functions Saharon Rosset, Ji Zhu and Trevor Hastie Department of Statistics Stanford University Stanford, CA, 94305 saharon, jzhu, hastie@stat.stanford.edu Abstract Margin maximizing
More informationClassification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC)
Classification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC) Eunsik Park 1 and Yc Ivan Chang 2 1 Chonnam National University, Gwangju, Korea 2 Academia Sinica, Taipei,
More informationBoosting. CAP5610: Machine Learning Instructor: GuoJun Qi
Boosting CAP5610: Machine Learning Instructor: GuoJun Qi Weak classifiers Weak classifiers Decision stump one layer decision tree Naive Bayes A classifier without feature correlations Linear classifier
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationBoosting Methods: Why They Can Be Useful for HighDimensional Data
New URL: http://www.rproject.org/conferences/dsc2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609395X Kurt Hornik,
More informationLecture 5: LDA and Logistic Regression
Lecture 5: and Logistic Regression Hao Helen Zhang Hao Helen Zhang Lecture 5: and Logistic Regression 1 / 39 Outline Linear Classification Methods Two Popular Linear Models for Classification Linear Discriminant
More informationIntroduction to Machine Learning
1, DATA11002 Introduction to Machine Learning Lecturer: Antti Ukkonen TAs: Saska Dönges and Janne Leppäaho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer,
More informationContents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)
Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture
More informationStatistical Machine Learning Hilary Term 2018
Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html
More information10701/ Machine Learning  Midterm Exam, Fall 2010
10701/15781 Machine Learning  Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: Email address: 2. There should be 15 numbered pages in this exam
More informationA Bahadur Representation of the Linear Support Vector Machine
Journal of Machine Learning Research 9 2008 13431368 Submitted 12/06; Revised 2/08; Published 7/08 A Bahadur Representation of the Linear Support Vector Machine JaYong Koo Department of Statistics Korea
More informationIndirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina
Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection
More informationAdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague
AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees
More informationLinear Regression and Discrimination
Linear Regression and Discrimination Kernelbased Learning Methods Christian Igel Institut für Neuroinformatik RuhrUniversität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationSupport Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM
1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM JeanPhilippe Vert Bioinformatics Center, Kyoto University, Japan JeanPhilippe.Vert@mines.org Human Genome Center, University
More informationChapter 6. Ensemble Methods
Chapter 6. Ensemble Methods Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Introduction
More informationStructured Statistical Learning with Support Vector Machine for Feature Selection and Prediction
Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohiostate.edu/ yklee Predictive
More informationLEAST ANGLE REGRESSION 469
LEAST ANGLE REGRESSION 469 Specifically for the Lasso, one alternative strategy for logistic regression is to use a quadratic approximation for the loglikelihood. Consider the Bayesian version of Lasso
More informationThe Bayes classifier
The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal
More informationGeneralized Boosted Models: A guide to the gbm package
Generalized Boosted Models: A guide to the gbm package Greg Ridgeway April 15, 2006 Boosting takes on various forms with different programs using different loss functions, different base models, and different
More informationThe Margin Vector, Admissible Loss and Multiclass Marginbased Classifiers
The Margin Vector, Admissible Loss and Multiclass Marginbased Classifiers Hui Zou University of Minnesota Ji Zhu University of Michigan Trevor Hastie Stanford University Abstract We propose a new framework
More informationClassification. Chapter Introduction. 6.2 The Bayes classifier
Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode
More informationStatistical Properties of Large Margin Classifiers
Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides
More informationIntroduction to Machine Learning
1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppäaho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer
More informationLecture 5: Classification
Lecture 5: Classification Advanced Applied Multivariate Analysis STAT 2221, Spring 2015 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department of Mathematical Sciences Binghamton
More informationRandom projection ensemble classification
Random projection ensemble classification Timothy I. Cannings Statistics for Big Data Workshop, Brunel Joint work with Richard Samworth Introduction to classification Observe data from two classes, pairs
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationIntroduction to Machine Learning Spring 2018 Note 18
CS 189 Introduction to Machine Learning Spring 2018 Note 18 1 Gaussian Discriminant Analysis Recall the idea of generative models: we classify an arbitrary datapoint x with the class label that maximizes
More informationA short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie
A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial SloanKettering Cancer Center http://cbio.mskcc.org/leslielab
More informationLecture 18: Multiclass Support Vector Machines
Fall, 2017 Outlines Overview of Multiclass Learning Traditional Methods for Multiclass Problems Onevsrest approaches Pairwise approaches Recent development for Multiclass Problems Simultaneous Classification
More informationSupervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing
Supervised Learning Unsupervised learning: To extract structure and postulate hypotheses about data generating process from observations x 1,...,x n. Visualize, summarize and compress data. We have seen
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More informationMachine Learning Support Vector Machines. Prof. Matteo Matteucci
Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way
More informationLogistic Regression Trained with Different Loss Functions. Discussion
Logistic Regression Trained with Different Loss Functions Discussion CS640 Notations We restrict our discussions to the binary case. g(z) = g (z) = g(z) z h w (x) = g(wx) = + e z = g(z)( g(z)) + e wx =
More informationIntroduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri  Introduction to Machine Learning page 2 Boosting Ideas Main idea:
More informationSTKIN4300 Statistical Learning Methods in Data Science
Outline of the lecture STKIN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no AdaBoost Introduction algorithm Statistical Boosting Boosting as a forward stagewise additive
More informationWhy does boosting work from a statistical view
Why does boosting work from a statistical view Jialin Yi Applied Mathematics and Computational Science University of Pennsylvania Philadelphia, PA 939 jialinyi@sas.upenn.edu Abstract We review boosting
More informationA ROBUST ENSEMBLE LEARNING USING ZEROONE LOSS FUNCTION
Journal of the Operations Research Society of Japan 008, Vol. 51, No. 1, 95110 A ROBUST ENSEMBLE LEARNING USING ZEROONE LOSS FUNCTION Natsuki Sano Hideo Suzuki Masato Koda Central Research Institute
More informationCS 1955: Machine Learning Problem Set 1
CS 955: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of
More informationIntroduction to Machine Learning
Introduction to Machine Learning Thomas G. Dietterich tgd@eecs.oregonstate.edu 1 Outline What is Machine Learning? Introduction to Supervised Learning: Linear Methods Overfitting, Regularization, and the
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationGaussian discriminant analysis Naive Bayes
DM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Outline 1. is 2. Multivariate
More informationEnsemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12
Ensemble Methods Charles Sutton Data Mining and Exploration Spring 2012 Bias and Variance Consider a regression problem Y = f(x)+ N(0, 2 ) With an estimate regression function ˆf, e.g., ˆf(x) =w > x Suppose
More informationThe classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know
The Bayes classifier Theorem The classifier satisfies where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know Alternatively, since the maximum it is
More informationThe classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA
The Bayes classifier Linear discriminant analysis (LDA) Theorem The classifier satisfies In linear discriminant analysis (LDA), we make the (strong) assumption that where the min is over all possible classifiers.
More informationBoosting. March 30, 2009
Boosting Peter Bühlmann buhlmann@stat.math.ethz.ch Seminar für Statistik ETH Zürich Zürich, CH8092, Switzerland Bin Yu binyu@stat.berkeley.edu Department of Statistics University of California Berkeley,
More informationLecture 4 Discriminant Analysis, knearest Neighbors
Lecture 4 Discriminant Analysis, knearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se
More informationA Simple Algorithm for Learning Stable Machines
A Simple Algorithm for Learning Stable Machines Savina Andonova and Andre Elisseeff and Theodoros Evgeniou and Massimiliano ontil Abstract. We present an algorithm for learning stable machines which is
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationMultivariate statistical methods and data mining in particle physics
Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general
More informationMachine Learning for Signal Processing Bayes Classification and Regression
Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For
More informationISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification
ISyE 6416: Computational Statistics Spring 2017 Lecture 5: Discriminant analysis and classification Prof. Yao Xie H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology
More informationLinear Methods for Prediction
Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we
More informationProbabilistic classification CE717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationThe exam is closed book, closed notes except your onepage (two sides) or twopage (one side) crib sheet.
CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your onepage (two sides) or twopage (one side) crib sheet. Please
More informationMachine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.
More informationLogistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu
Logistic Regression Review 10601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data
More informationLearning with multiple models. Boosting.
CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models
More informationKernelized Perceptron Support Vector Machines
Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationAdvanced Machine Learning
Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression  Maximum likelihood estimation (MLE)  Maximum a posteriori (MAP) estimation Biasvariance tradeoff Linear
More informationA Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data
A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohiostate.edu/ yklee May 13, 2005
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Nonparametric
More informationA Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University
A Gentle Introduction to Gradient Boosting Cheng Li chengli@ccs.neu.edu College of Computer and Information Science Northeastern University Gradient Boosting a powerful machine learning algorithm it can
More informationCMUQ Lecture 24:
CMUQ 15381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationECE271B. Nuno Vasconcelos ECE Department, UCSD
ECE271B Statistical ti ti Learning II Nuno Vasconcelos ECE Department, UCSD The course the course is a graduate level course in statistical learning in SLI we covered the foundations of Bayesian or generative
More informationClassification Methods II: Linear and Quadratic Discrimminant Analysis
Classification Methods II: Linear and Quadratic Discrimminant Analysis Rebecca C. Steorts, Duke University STA 325, Chapter 4 ISL Agenda Linear Discrimminant Analysis (LDA) Classification Recall that linear
More informationTUM 2016 Class 1 Statistical learning theory
TUM 2016 Class 1 Statistical learning theory Lorenzo Rosasco UNIGEMITIIT July 25, 2016 Machine learning applications Texts Images Data: (x 1, y 1 ),..., (x n, y n ) Note: x i s huge dimensional! All
More informationA Magiv CV Theory for LargeMargin Classifiers
A Magiv CV Theory for LargeMargin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector
More informationLearning theory. Ensemble methods. Boosting. Boosting: history
Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over
More informationOn the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost
On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost Hamed MasnadiShirazi Statistical Visual Computing Laboratory, University of California, San Diego La
More informationMachine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1
Machine Learning 1 Linear Classifiers Marius Kloft Humboldt University of Berlin Summer Term 2014 Machine Learning 1 Linear Classifiers 1 Recap Past lectures: Machine Learning 1 Linear Classifiers 2 Recap
More informationChapter 14 Combining Models
Chapter 14 Combining Models T61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients
More informationMachine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015
Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint inputoutput features, Likelihood Maximization, Logistic regression, binary
More informationIntroduction to machine learning and pattern recognition Lecture 2 Coryn BailerJones
Introduction to machine learning and pattern recognition Lecture 2 Coryn BailerJones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive
More informationMachine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012
Machine Learning 10601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)
More informationVoting (Ensemble Methods)
1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers
More informationAn Introduction to Statistical and Probabilistic Linear Models
An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationOrdinal Classification with Decision Rules
Ordinal Classification with Decision Rules Krzysztof Dembczyński 1, Wojciech Kotłowski 1, and Roman Słowiński 1,2 1 Institute of Computing Science, Poznań University of Technology, 60965 Poznań, Poland
More informationMachine Learning for Signal Processing Bayes Classification
Machine Learning for Signal Processing Bayes Classification Class 16. 24 Oct 2017 Instructor: Bhiksha Raj  Abelino Jimenez 11755/18797 1 Recap: KNN A very effective and simple way of performing classification
More information