Directly and Efficiently Optimizing Prediction Error and AUC of Linear Classifiers

Size: px
Start display at page:

Download "Directly and Efficiently Optimizing Prediction Error and AUC of Linear Classifiers"

Transcription

1 Directly and Efficiently Optimizing Prediction Error and AUC of Linear Classifiers Hiva Ghanbari Joint work with Prof. Katya Scheinberg Industrial and Systems Engineering Department US & Mexico Workshop on Optimization and its Applications Huatulco, Mexico January 2018 Hiva Ghanbari (Lehigh University) 1 / 34

2 Outline Introduction Directly Optimizing Prediction Error Directly Optimizing AUC Numerical Analysis Summary Hiva Ghanbari (Lehigh University) 2 / 34

3 Outline Introduction Directly Optimizing Prediction Error Directly Optimizing AUC Numerical Analysis Summary Hiva Ghanbari (Lehigh University) 3 / 34

4 Supervised Learning Problem Given a finite sample data set S of n (input, label) pairs, e.g., S := {(x i, y i ) : i = 1,, n}, where x i R d and y i {+1, 1}. We are interested in Binary Classification Problem in supervised learning Binary Classification Problem Discrete valued output +1 or 1 We are interested in linear classifier (predictor) f(x; w) = w T x so that f : X Y, where X denote the space of input values and Y the space of output values. Hiva Ghanbari (Lehigh University) 4 / 34

5 Supervised Learning Problem How good is this classifier? Prediction Error Area Under ROC Curve (AUC) Given a finite sample data set S of n (input, label) pairs, e.g., S := {(x i, y i ) : i = 1,, n}, where x i R d and y i {+1, 1}. We are interested in Binary Classification Problem in supervised learning Binary Classification Problem Discrete valued output +1 or 1 We are interested in linear classifier (predictor) f(x; w) = w T x so that f : X Y, where X denote the space of input values and Y the space of output values. Hiva Ghanbari (Lehigh University) 4 / 34

6 Outline Introduction Directly Optimizing Prediction Error Directly Optimizing AUC Numerical Analysis Summary Hiva Ghanbari (Lehigh University) 5 / 34

7 Expected Risk (Prediction Error) In S, each (x i, y i ) is an i.i.d. observation of the random variables (X, Y ). (X, Y ) has an unknown joint probability distribution P X,Y (x, y) over X and Y. The expected risk associated with a linear classifier f(x; w) = w T x for zero-one loss function is defined as where R 0 1 (f) = E X,Y [l 0 1 (f(x; w), Y )] = X Y l 0 1 (f(x; w), y) = P X,Y (x, y)l 0 1 (f(x; w), y) dydx { +1 if y f(x; w) < 0, 0 if y f(x; w) 0. Hiva Ghanbari (Lehigh University) 6 / 34

8 Empirical Risk Minimization The joint probability distribution P X,Y (x, y) is unknown The empirical risk of the linear classifier f(x; w) for zero-one loss function over the finite training set S is of the interest, e.g., R 0 1 (f; S) = 1 n n l 0 1 (f(x i ; w), y i ). i=1 Hiva Ghanbari (Lehigh University) 7 / 34

9 Empirical Risk Minimization The joint probability distribution P X,Y (x, y) is unknown The empirical risk of the linear classifier f(x; w) for zero-one loss function over the finite training set S is of the interest, e.g., n R 0 1 (f; S) = 1 l 0 1 (f(x i ; w), y i ). n i=1 Utilizing the logistic regression loss function instead of 0-1 loss function, results n R log (f; S) = 1 log (1 + exp( y i f(x i ; w))), n i=1 Practically { } n min F log (w) = 1 log (1 + exp( y i f(x i ; w))) + λ w 2. w R d n i=1 Hiva Ghanbari (Lehigh University) 7 / 34

10 Alternative Interpretation of the Prediction Error We can interpret prediction error as a probability value: F error(w) = R 0 1 (f) = E X,Y [l 0 1 (f(x; w), Y )] = P (Y w T X < 0). Hiva Ghanbari (Lehigh University) 8 / 34

11 Alternative Interpretation of the Prediction Error We can interpret prediction error as a probability value: F error(w) = R 0 1 (f) = E X,Y [l 0 1 (f(x; w), Y )] = P (Y w T X < 0). If the true values of the prior probabilities P (Y = +1) and P (Y = 1) are known or obtainable from a trivial calculation, then Lemma 1 Expected risk can be interpreted in terms of the probability value, so that where F error(w) = P (Y w T X < 0) = P ( Z + 0 ) P (Y = +1) + ( 1 P ( Z 0 )) P (Y = 1), Z + = w T X +, and Z = w T X, for X + and X as random variables from positive and negative classes, respectively. Hiva Ghanbari (Lehigh University) 8 / 34

12 Data with Any Arbitrary Distribution Suppose (X 1,, X n) is a multivariate random variable. For a given mapping function g( ) we are interested in the c.d.f of Z = g (X 1,, X n). If we define a region in space {X 1 X n} such that g(x 1,, x n) z, then we have F Z (z) = P (Z z) = P (g(x) z) = P ({x 1 X 1,, x n X n : g(x 1,, x n) z}) = {x 1 X 1,,x n X n:g(x 1,,x n) z} f X1,,X n (x 1,, x n)dx 1 dx n. Hiva Ghanbari (Lehigh University) 9 / 34

13 Data with Normal Distribution Assume X + N ( µ +, Σ +) and X N ( µ, Σ ). Hiva Ghanbari (Lehigh University) 10 / 34

14 Data with Normal Distribution Assume Why Normal? X + N ( µ +, Σ +) and X N ( µ, Σ ). The family of multivariate Normal distributions is closed under linear transformations. Theorem 2 (Tong (1990)) If X N (µ, Σ) and Z = CX + b, where C is any given m n real matrix and b is any m 1 real vector, then Z N ( Cµ + b, CΣC T ). Normal Distribution has a smooth c.d.f. Hiva Ghanbari (Lehigh University) 10 / 34

15 Prediction Error as a Smooth Function Theorem 3 Suppose that Then, X + N ( µ +, Σ +) and X N ( µ, Σ ). F error(w) = P (Y = +1) (1 φ (µ Z +/σ Z +)) + P (Y = 1)φ (µ Z /σ Z ), where µ Z + = w T µ +, σ Z + = w T Σ + w, µ Z = w T µ, σ Z = w T Σ w, and in which φ is the c.d.f of the standard normal distribution, e.g., x 1 φ(x) = exp( 1 2π 2 t2 )dt, for x R. Hiva Ghanbari (Lehigh University) 11 / 34

16 Prediction Error as a Smooth Function Theorem 3 Suppose that Then, X + N ( µ +, Σ +) and X N ( µ, Σ ). F error(w) = P (Y = +1) (1 φ (µ Z +/σ Z +)) + P (Y = 1)φ (µ Z /σ Z ), where µ Z + = w T µ +, σ Z + = w T Σ + w, µ Z = w T µ, σ Z = w T Σ w, and in which φ is the c.d.f of the standard normal distribution, e.g., x 1 φ(x) = exp( 1 2π 2 t2 )dt, for x R. Prediction error is a smooth function of w we can compute the gradient and... Hiva Ghanbari (Lehigh University) 11 / 34

17 Outline Introduction Directly Optimizing Prediction Error Directly Optimizing AUC Numerical Analysis Summary Hiva Ghanbari (Lehigh University) 12 / 34

18 Learning From Imbalanced Data Sets Many real-world machine learning problems are dealing with imbalanced learning data (a) Balanced data set (b) Imbalanced data set Hiva Ghanbari (Lehigh University) 13 / 34

19 Receiver Operating Characteristic (ROC) Curve Sorted outputs based on descending value of f(x; w) = w T x f(x; w) Predicted Positive Predicted Negative Actual Positive True Positive (TP) False Negative (FN) Actual Negative False Positive (FP) True Negative (TN) Various thresholds result in different True Positive Rate = False Positive Rate = F P F P +T N. T P T P +F N and ROC curve presents the tradeoff between the TPR and the FPR, for all possible thresholds. Hiva Ghanbari (Lehigh University) 14 / 34

20 Area Under ROC Curve (AUC) How we can compare ROC curves? Hiva Ghanbari (Lehigh University) 15 / 34

21 Area Under ROC Curve (AUC) How we can compare ROC curves? Higher AUC = Better classifier Hiva Ghanbari (Lehigh University) 15 / 34

22 An Unbiased Estimation of AUC Value An unbiased estimation of the AUC value of a linear classifier can be obtained via Wilcoxon-Mann-Whitney (WMW) statistic result (Mann and R.Whitney (1947)), e.g., where AUC ( f; S +, S ) = n + n i=1 j=1 1 [ f(x + i ; w) > f(x j ; w)] n + n. 1 [ f(x + i ; w) > f(x j ; w)] = in which S = S + S. { +1 if f(x + i ; w) > f(x j ; w), 0 otherwise. Hiva Ghanbari (Lehigh University) 16 / 34

23 AUC Approximation via Surrogate Losses The indicator function 1[ ] can be approximate with: Sigmoid surrogate function, Yan et al. (2003), Pairwise exponential loss or pairwise logistic loss, Rudin and Schapire (2009), Pairwise hinge loss, Steck (2007), F hinge (w) = n + n i=1 j=1 max { 0, 1 ( f(x j ; w) f(x+ i ; w))} n + n. Hiva Ghanbari (Lehigh University) 17 / 34

24 Measuring AUC Statistically Let X + and X denote the space of the positive and negative input values, Then x + i is an i.i.d. observation of the random variable X + and x j i.i.d. observation of the random variable X, is an If the joint probability distribution P X +,X (x+, x ) is known, the actual associated AUC value of a linear classifier f(x; w) = w T x is defined as [ [ ]] AUC(f) = E X +,X 1 f(x + ; w) > f(x ; w) = X + X P X +,X (x+, x )1 [ f(x + ; w) > f(x ; w) ] dx dx +. Hiva Ghanbari (Lehigh University) 18 / 34

25 Alternative Interpretation of AUC Lemma 4 We can interpret AUC value as a probability value: F AUC (w) = 1 AUC(f) = 1 E X +,X [ 1 [ f ( X + ; w ) > f ( X ; w )]] = 1 P (Z < 0), where Z = w T ( X X +), for X + and X as random variables from positive and negative classes, respectively. Hiva Ghanbari (Lehigh University) 19 / 34

26 AUC as a Smooth Function Theorem 5 If two random variables X + and X have a joint multivariate normal distribution, such that ( X + ) X N (µ, Σ), ( µ + ) where µ = µ ( ) Σ and Σ = ++ Σ + Σ + Σ, then the AUC function can be defined as ( ) µz F AUC (w) = 1 φ, σ Z where ( µ Z = w T µ µ +) and σ Z = w T (Σ + Σ ++ Σ + Σ + ) w, and is the c.d.f of the standard normal distribution. Hiva Ghanbari (Lehigh University) 20 / 34

27 AUC as a Smooth Function Theorem 5 If two random variables X + and X have a joint multivariate normal distribution, such that ( X + ) X N (µ, Σ), ( µ + ) where µ = µ ( ) Σ and Σ = ++ Σ + Σ + Σ, then the AUC function can be defined as ( ) µz F AUC (w) = 1 φ, σ Z where ( µ Z = w T µ µ +) and σ Z = w T (Σ + Σ ++ Σ + Σ + ) w, and is the c.d.f of the standard normal distribution. AUC is a smooth function of w we can compute the gradient. Hiva Ghanbari (Lehigh University) 20 / 34

28 Outline Introduction Directly Optimizing Prediction Error Directly Optimizing AUC Numerical Analysis Summary Hiva Ghanbari (Lehigh University) 21 / 34

29 Computational Settings We have used gradient descent with backtracking line search as the optimization method. The algorithm is implemented in Python and computations are performed on the computational cluster. We have used both artificial data sets and real data sets. We used five-fold cross-validation with the train-test framework. Hiva Ghanbari (Lehigh University) 22 / 34

30 Directly Optimizing Prediction Error and AUC Artificial Data Sets Information able 1: Artificial data sets statistics. d : number of features, n : number of data points, P +,P Artificial : prior data probabilities, points with normal out : percentage distributionofare outlier generated datarandomly. points. Name d n P + P out% data data data data data data data data data ese data sets, we have swapped an specific percentage of the positive training data sets ith the negative ones and vice versa. The numerical results summarized in Table 2 are the results of 20 randomly generated ata sets. At each run we used 80 percent of the data points as the training data and e rest of the data points as the test data. The reported average accuracy and standard eviation are based on the test data not seen during the training process. The results resented Hiva Ghanbari in Table (Lehigh 2 are University) based on minimizing F error (w) versusf log (w). In the process 23 / of 34

31 Optimizing F error (w) vs. F log (w) on Artificial Data Table 2: F error (w) vs.f log (w) minimization via Algorithm 1 on artificial data sets. F error (w) Minimization F error (w) Minimization F log (w) Minimization Data Exact moments Approximate moments Accuracy± std Time (s) Accuracy ± std Time (s) Accuracy ± std Time (s) data ± ± ± data ± ± ± data ± ± ± data ± ± ± data ± ± ± data ± ± ± data ± ± ± data ± ± ± data ± ± ± Table 3: Real data sets statistics. d : number of features, n : number of data points, P +,P : prior probabilities, AC : attribute characteristics. Name AC d n P + P fourclass [ 1, 1], real Hiva Ghanbari (Lehigh University) 24 / 34

32 8 data ± ± ± Real Data Sets Information Table 3: Real data sets statistics. d : number of features, n : number These data sets can be downloaded from LIBSVM website 1 of data points, and UCI machine P +,P : prior probabilities, AC : attribute characteristics. learning repository 2. Name AC d n P + P fourclass [ 1, 1], real svmguide1 [ 1, 1], real diabetes [ 1, 1], real shuttle [ 1, 1], real vowel [ 6, 6], int magic04 [ 1, 1], real poker [1, 13], int letter [0, 15], int segment [ 1, 1], real svmguide3 [ 1, 1], real ijcnn1 [ 1, 1], real german [ 1, 1], real landsat satellite [27, 157], int sonar [ 1, 1], real a9a binary w8a binary mnist [0, 1], real colon-cancer [ 1, 1], real gisette [ 1, 1], real rcv1 [ 1, 1], real cjlin/libsvmtools/datasets/binary.html 2 Table 4 summarizes the computational results to compare the performance of the obtained linear classifiers through minimizing F error (w) versusf log (w) via Algorithm 1. We Hiva Ghanbari (Lehigh University) 25 / 34

33 Optimizing F error (w) vs. F log (w) on Real Data Table 4: F error (w) vs.f log (w) minimization via Algorithm 1 on real data sets. F error (w) Minimization F log (w) Minimization Data Accuracy± std Time (s) Accuracy ± std Time (s) fourclass ± ± svmguide ± ± diabetes ± ± shuttle ± ± vowel ± ± magic ± ± poker ± ± letter ± ± segment ± ± svmguide ± ± ijcnn ± ± german ± ± landsat satellite ± ± sonar ± ± a9a ± ± w8a ± ± mnist ± ± colon-cancer ± ± gisette ± ± rcv1?±?± Hiva Ghanbari (Lehigh University) 26 / 34

34 Optimizing F error (w) vs. F log (w) on Real Data Minimizing F log Minimizing F error Hiva Ghanbari (Lehigh University) 27 / 34

35 Table 6 summarizes the results on real data sets, in a similar manner to Table 4, withthis di erent that the AUC values of the linear classifiers obtained through minimizing F AUC (w) versus F hinge (w) are of the interest. As we can see in Table 6, the average AUC values of the linear classifiers obtained through minimizing F AUC (w) and F hinge (w) are comparable in 14 data sets among the total number of 20. Although in terms of the solution time, minimizing F AUC (w) constantly surpasses minimizing F hinge (w). Moreover, In the case of 4 data sets, minimizing F AUC (w) performs better than minimizing F hinge (w) in terms of Hiva Ghanbari (Lehigh University) 28 / 34 Optimizing F AUC (w) vs. F hinge (w) on Artificial Data Table 5: F AUC (w) vs.f hinge (w) minimization via Algorithm 1 on artificial data sets. F AUC (w) Minimization F AUC (w) Minimization F hinge (w) Minimization Data Exact moments Approximate moments AUC± std Time (s) AUC ± std Time (s) AUC ± std Time (s) data ± ± ± data ± ± ± data ± ± ± data ± ± ± data ± ± ± data ± ± ± data ± ± ± data ± ± ± data ± ± ±

36 5. Conclusion Hiva Ghanbari (Lehigh University) 29 / 34 Optimizing F AUC (w) vs. F hinge (w) on Real Data Table 6: F AUC (w) vs.f hinge (w) Minimization via Algorithm 1 on Real Data Sets F AUC (w) Minimization F hinge (w) Minimization Data AUC± std Time (s) AUC ± std Time (s) fourclass ± ± svmguide ± ± diabetes ± ± shuttle ± ± vowel ± ± magic ± ± poker ± ± letter ± ± segment ± ± svmguide ± ± ijcnn ± ± german ± ± landsat satellite ± ± sonar ± ± a9a ± ± w8a ± ± mnist ± ± colon-cancer ± ± gisette ± ± rcv1 ± ±

37 Optimizing F AUC (w) vs. F hinge (w) on Real Data Minimizing F hinge Minimizing F AUC Hiva Ghanbari (Lehigh University) 30 / 34

38 Outline Introduction Directly Optimizing Prediction Error Directly Optimizing AUC Numerical Analysis Summary Hiva Ghanbari (Lehigh University) 31 / 34

39 Summary We proposed some conditions under which the expected error and AUC are smooth functions. Any gradient-based optimization method can be applied to directly optimize these functions. These new proposed approaches work efficiently without perturbing the unknown distribution of the real data sets. Studying data distributions may lead to new efficient approaches. Hiva Ghanbari (Lehigh University) 32 / 34

40 References H. B. Mann and D. R.Whitney. On a test whether one of two random variables is stochastically larger than the other. Ann. Math. Statist, 18:50-60, Y.L. Tong. The multivariate normal distribution. Springer Series in Statistics, G. Casella and R.L. Berger. Statistical Inference. Pacific Grove, CA: Duxbury, 2, Hiva Ghanbari (Lehigh University) 33 / 34

41 Thanks for your attention!

Pointwise Exact Bootstrap Distributions of Cost Curves

Pointwise Exact Bootstrap Distributions of Cost Curves Pointwise Exact Bootstrap Distributions of Cost Curves Charles Dugas and David Gadoury University of Montréal 25th ICML Helsinki July 2008 Dugas, Gadoury (U Montréal) Cost curves July 8, 2008 1 / 24 Outline

More information

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Lecture 4 Discriminant Analysis, k-nearest Neighbors Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se

More information

Surrogate regret bounds for generalized classification performance metrics

Surrogate regret bounds for generalized classification performance metrics Surrogate regret bounds for generalized classification performance metrics Wojciech Kotłowski Krzysztof Dembczyński Poznań University of Technology PL-SIGML, Częstochowa, 14.04.2016 1 / 36 Motivation 2

More information

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant

More information

Performance Evaluation and Comparison

Performance Evaluation and Comparison Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation

More information

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

Online Advertising is Big Business

Online Advertising is Big Business Online Advertising Online Advertising is Big Business Multiple billion dollar industry $43B in 2013 in USA, 17% increase over 2012 [PWC, Internet Advertising Bureau, April 2013] Higher revenue in USA

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome. Classification Classification is similar to regression in that the goal is to use covariates to predict on outcome. We still have a vector of covariates X. However, the response is binary (or a few classes),

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Lecture 3 Classification, Logistic Regression

Lecture 3 Classification, Logistic Regression Lecture 3 Classification, Logistic Regression Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se F. Lindsten Summary

More information

Diagnostics. Gad Kimmel

Diagnostics. Gad Kimmel Diagnostics Gad Kimmel Outline Introduction. Bootstrap method. Cross validation. ROC plot. Introduction Motivation Estimating properties of an estimator. Given data samples say the average. x 1, x 2,...,

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Classification objectives COMS 4771

Classification objectives COMS 4771 Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

A Large Deviation Bound for the Area Under the ROC Curve

A Large Deviation Bound for the Area Under the ROC Curve A Large Deviation Bound for the Area Under the ROC Curve Shivani Agarwal, Thore Graepel, Ralf Herbrich and Dan Roth Dept. of Computer Science University of Illinois Urbana, IL 680, USA {sagarwal,danr}@cs.uiuc.edu

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT) Metric Embedding of Task-Specific Similarity Greg Shakhnarovich Brown University joint work with Trevor Darrell (MIT) August 9, 2006 Task-specific similarity A toy example: Task-specific similarity A toy

More information

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Regression and Classification with Linear Models CMPSCI 383 Nov 15, 2011! Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011! 1 Todayʼs topics" Learning from Examples: brief review! Univariate Linear Regression! Batch gradient descent! Stochastic gradient

More information

Machine Learning and Data Mining. Linear classification. Kalev Kask

Machine Learning and Data Mining. Linear classification. Kalev Kask Machine Learning and Data Mining Linear classification Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ = f(x ; q) Parameters q Program ( Learner ) Learning algorithm Change q

More information

Bayesian Decision Theory

Bayesian Decision Theory Introduction to Pattern Recognition [ Part 4 ] Mahdi Vasighi Remarks It is quite common to assume that the data in each class are adequately described by a Gaussian distribution. Bayesian classifier is

More information

MBA: Mini-Batch AUC Optimization

MBA: Mini-Batch AUC Optimization MBA: Mini-Batch AUC Optimization San Gultekin, Avishek Saha, Adwait Ratnaparkhi, and John Paisley arxiv:5.22v2 [cs.lg] 3 May 208 Abstract Area under the receiver operating characteristics curve (AUC) is

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

More information

Variations of Logistic Regression with Stochastic Gradient Descent

Variations of Logistic Regression with Stochastic Gradient Descent Variations of Logistic Regression with Stochastic Gradient Descent Panqu Wang(pawang@ucsd.edu) Phuc Xuan Nguyen(pxn002@ucsd.edu) January 26, 2012 Abstract In this paper, we extend the traditional logistic

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert Logistic regression Comments Mini-review and feedback These are equivalent: x > w = w > x Clarification: this course is about getting you to be able to think as a machine learning expert There has to be

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution

More information

Least Squares Classification

Least Squares Classification Least Squares Classification Stephen Boyd EE103 Stanford University November 4, 2017 Outline Classification Least squares classification Multi-class classifiers Classification 2 Classification data fitting

More information

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

Evaluating Classifiers. Lecture 2 Instructor: Max Welling Evaluating Classifiers Lecture 2 Instructor: Max Welling Evaluation of Results How do you report classification error? How certain are you about the error you claim? How do you compare two algorithms?

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

Data Analytics for Social Science

Data Analytics for Social Science Data Analytics for Social Science Johan A. Elkink School of Politics & International Relations University College Dublin 17 October 2017 Outline 1 2 3 4 5 6 Levels of measurement Discreet Continuous Nominal

More information

Linear Methods for Classification

Linear Methods for Classification Linear Methods for Classification Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Classification Supervised learning Training data: {(x 1, g 1 ), (x 2, g 2 ),..., (x

More information

Class Prior Estimation from Positive and Unlabeled Data

Class Prior Estimation from Positive and Unlabeled Data IEICE Transactions on Information and Systems, vol.e97-d, no.5, pp.1358 1362, 2014. 1 Class Prior Estimation from Positive and Unlabeled Data Marthinus Christoffel du Plessis Tokyo Institute of Technology,

More information

Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Advanced Introduction to Machine Learning CMU-10715 Risk Minimization Barnabás Póczos What have we seen so far? Several classification & regression algorithms seem to work fine on training datasets: Linear

More information

A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector

More information

Performance Evaluation

Performance Evaluation Performance Evaluation Confusion Matrix: Detected Positive Negative Actual Positive A: True Positive B: False Negative Negative C: False Positive D: True Negative Recall or Sensitivity or True Positive

More information

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher Lecture 3 STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher Previous lectures What is machine learning? Objectives of machine learning Supervised and

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Appendix: Kernelized Online Imbalanced Learning with Fixed Budgets

Appendix: Kernelized Online Imbalanced Learning with Fixed Budgets Appendix: Kernelized Online Imbalanced Learning with Fixed Budgets Junjie u 1,, aiqin Yang 1,, Irwin King 1,, Michael R. Lyu 1,, and Anthony Man-Cho So 3 1 Shenzhen Key Laboratory of Rich Media Big Data

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

Does Modeling Lead to More Accurate Classification?

Does Modeling Lead to More Accurate Classification? Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang

More information

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function

More information

Statistical Learning Theory and the C-Loss cost function

Statistical Learning Theory and the C-Loss cost function Statistical Learning Theory and the C-Loss cost function Jose Principe, Ph.D. Distinguished Professor ECE, BME Computational NeuroEngineering Laboratory and principe@cnel.ufl.edu Statistical Learning Theory

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

A Study of Relative Efficiency and Robustness of Classification Methods

A Study of Relative Efficiency and Robustness of Classification Methods A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes CS229 Supplemental Lecture notes John Duchi Binary classification In binary classification problems, the target y can take on at only two values. In this set of notes, we show how to model this problem

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: Tony Jebara Topic 3 Additive Models and Linear Regression Sinusoids and Radial Basis Functions Classification Logistic Regression Gradient Descent Polynomial Basis Functions

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression ECE521: Inference Algorithms and Machine Learning University of Toronto Assignment 1: k-nn and Linear Regression TA: Use Piazza for Q&A Due date: Feb 7 midnight, 2017 Electronic submission to: ece521ta@gmailcom

More information

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization

More information

Performance Evaluation

Performance Evaluation Performance Evaluation David S. Rosenberg Bloomberg ML EDU October 26, 2017 David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 1 / 36 Baseline Models David S. Rosenberg (Bloomberg ML EDU) October 26,

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

An Overview of Outlier Detection Techniques and Applications

An Overview of Outlier Detection Techniques and Applications Machine Learning Rhein-Neckar Meetup An Overview of Outlier Detection Techniques and Applications Ying Gu connygy@gmail.com 28.02.2016 Anomaly/Outlier Detection What are anomalies/outliers? The set of

More information

On Equivalence Relationships Between Classification and Ranking Algorithms

On Equivalence Relationships Between Classification and Ranking Algorithms On Equivalence Relationships Between Classification and Ranking Algorithms On Equivalence Relationships Between Classification and Ranking Algorithms Şeyda Ertekin Cynthia Rudin MIT Sloan School of Management

More information

Hypothesis Evaluation

Hypothesis Evaluation Hypothesis Evaluation Machine Learning Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall 1395 1 / 31 Table of contents 1 Introduction

More information

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces Notes on the framework of Ando and Zhang (2005 Karl Stratos 1 Beyond learning good functions: learning good spaces 1.1 A single binary classification problem Let X denote the problem domain. Suppose we

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

IMBALANCED streaming data are prevalent in various. Online Nonlinear AUC Maximization for Imbalanced Datasets

IMBALANCED streaming data are prevalent in various. Online Nonlinear AUC Maximization for Imbalanced Datasets IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL., NO., 201X 1 Online Nonlinear AUC Maximization for Imbalanced Datasets Junjie Hu, Haiqin Yang Member, IEEE, Michael R. Lyu Fellow, IEEE,

More information

AUC Maximizing Support Vector Learning

AUC Maximizing Support Vector Learning Maximizing Support Vector Learning Ulf Brefeld brefeld@informatik.hu-berlin.de Tobias Scheffer scheffer@informatik.hu-berlin.de Humboldt-Universität zu Berlin, Department of Computer Science, Unter den

More information

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz

More information

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1 BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1 Shaobo Li University of Cincinnati 1 Partially based on Hastie, et al. (2009) ESL, and James, et al. (2013) ISLR Data Mining I

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Model Accuracy Measures

Model Accuracy Measures Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses

More information

Data Mining algorithms

Data Mining algorithms Data Mining algorithms 2017-2018 spring 02.07-09.2018 Overview Classification vs. Regression Evaluation I Basics Bálint Daróczy daroczyb@ilab.sztaki.hu Basic reachability: MTA SZTAKI, Lágymányosi str.

More information

Machine Learning Linear Models

Machine Learning Linear Models Machine Learning Linear Models Outline II - Linear Models 1. Linear Regression (a) Linear regression: History (b) Linear regression with Least Squares (c) Matrix representation and Normal Equation Method

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Logistic Regression Trained with Different Loss Functions. Discussion

Logistic Regression Trained with Different Loss Functions. Discussion Logistic Regression Trained with Different Loss Functions Discussion CS640 Notations We restrict our discussions to the binary case. g(z) = g (z) = g(z) z h w (x) = g(wx) = + e z = g(z)( g(z)) + e wx =

More information

Performance Evaluation

Performance Evaluation Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Example:

More information

Top-k Parametrized Boost

Top-k Parametrized Boost Top-k Parametrized Boost Turki Turki 1,4, Muhammad Amimul Ihsan 2, Nouf Turki 3, Jie Zhang 4, Usman Roshan 4 1 King Abdulaziz University P.O. Box 80221, Jeddah 21589, Saudi Arabia tturki@kau.edu.sa 2 Department

More information

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69 R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual

More information

15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, regularization, and evaluation

15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, regularization, and evaluation 15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, regularization, and evaluation J. Zico Kolter Carnegie Mellon University Fall 2016 1 Outline Example: return to peak demand prediction

More information

Machine Learning in the Data Revolution Era

Machine Learning in the Data Revolution Era Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,

More information

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler + Machine Learning and Data Mining Bayes Classifiers Prof. Alexander Ihler A basic classifier Training data D={x (i),y (i) }, Classifier f(x ; D) Discrete feature vector x f(x ; D) is a con@ngency table

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Beyond Accuracy: Scalable Classification with Complex Metrics

Beyond Accuracy: Scalable Classification with Complex Metrics Beyond Accuracy: Scalable Classification with Complex Metrics Sanmi Koyejo University of Illinois at Urbana-Champaign Joint work with B. Yan @UT Austin K. Zhong @UT Austin P. Ravikumar @CMU N. Natarajan

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

Linear and Logistic Regression. Dr. Xiaowei Huang

Linear and Logistic Regression. Dr. Xiaowei Huang Linear and Logistic Regression Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Two Classical Machine Learning Algorithms Decision tree learning K-nearest neighbor Model Evaluation Metrics

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

Predicting the Probability of Correct Classification

Predicting the Probability of Correct Classification Predicting the Probability of Correct Classification Gregory Z. Grudic Department of Computer Science University of Colorado, Boulder grudic@cs.colorado.edu Abstract We propose a formulation for binary

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation

On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation Harikrishna Narasimhan Shivani Agarwal epartment of Computer Science and Automation Indian

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

ABC-LogitBoost for Multi-Class Classification

ABC-LogitBoost for Multi-Class Classification Ping Li, Cornell University ABC-Boost BTRY 6520 Fall 2012 1 ABC-LogitBoost for Multi-Class Classification Ping Li Department of Statistical Science Cornell University 2 4 6 8 10 12 14 16 2 4 6 8 10 12

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

Performance Measures. Sören Sonnenburg. Fraunhofer FIRST.IDA, Kekuléstr. 7, Berlin, Germany

Performance Measures. Sören Sonnenburg. Fraunhofer FIRST.IDA, Kekuléstr. 7, Berlin, Germany Sören Sonnenburg Fraunhofer FIRST.IDA, Kekuléstr. 7, 2489 Berlin, Germany Roadmap: Contingency Table Scores from the Contingency Table Curves from the Contingency Table Discussion Sören Sonnenburg Contingency

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Learning Classification with Auxiliary Probabilistic Information Quang Nguyen Hamed Valizadegan Milos Hauskrecht

Learning Classification with Auxiliary Probabilistic Information Quang Nguyen Hamed Valizadegan Milos Hauskrecht Learning Classification with Auxiliary Probabilistic Information Quang Nguyen Hamed Valizadegan Milos Hauskrecht Computer Science Department University of Pittsburgh Outline Introduction Learning with

More information

Multicategory Vertex Discriminant Analysis for High-Dimensional Data

Multicategory Vertex Discriminant Analysis for High-Dimensional Data Multicategory Vertex Discriminant Analysis for High-Dimensional Data Tong Tong Wu Department of Epidemiology and Biostatistics University of Maryland, College Park October 8, 00 Joint work with Prof. Kenneth

More information

Stochastic gradient descent; Classification

Stochastic gradient descent; Classification Stochastic gradient descent; Classification Steve Renals Machine Learning Practical MLP Lecture 2 28 September 2016 MLP Lecture 2 Stochastic gradient descent; Classification 1 Single Layer Networks MLP

More information

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER 204 205 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE hour Please note that the rubric of this paper is made different from many other papers.

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)

More information

On the Consistency of AUC Pairwise Optimization

On the Consistency of AUC Pairwise Optimization On the Consistency of AUC Pairwise Optimization Wei Gao and Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University Collaborative Innovation Center of Novel Software Technology

More information