Supplementary Materials for

Size: px
Start display at page:

Download "Supplementary Materials for"

Transcription

1 advances.sciencemag.org/cgi/content/full/4/2/eaao659/dc Supplementary Materials for Neyman-Pearson classification algorithms and NP receiver operating characteristics The PDF file includes: Xin Tong, Yang Feng, Jingyi Jessica Li Published 2 February 28, Sci. Adv. 4, eaao659 (28) DOI:.26/sciadv.aao659 Proof of Proposition Conditional type II error bounds in NP-ROC bands Empirical ROC curves versus NP-ROC bands in guiding users to choose classifiers to satisfy type I error control Effects of majority voting on the type I and II errors of the ensemble classifier table S. Results of LR in Simulation S2. table S2. Results of SVMs in Simulation S2. table S3. Results of RFs in Simulation S2. table S4. Results of NB in Simulation S2. table S5. Results of LDA in Simulation S2. table S6. Results of AdaBoost in Simulation S2. table S7. Description of variables used in real data application. table S8. The performance of the NP umbrella algorithm in real data application 2. table S9. Input information of the nproc package (version 2..9). Other Supplementary Material for this manuscript includes the following: (available at advances.sciencemag.org/cgi/content/full/4/2/eaao659/dc) R codes and data sets

2 Supplementary Materials Proof of Proposition Proof: In our context, let T (k) be the k-th ordered classification score of a left-out class sample (i.e., class sample not used to train a base algorithm). Suppose T (k) is the chosen threshold on classification scores. Then the classifier is ϕ k = (T > T (k) ), where T is the classification score of an independent observation from class, and the population type I error of ϕ k given T (k) is P[T > T (k) T (k) ] = F(T (k) ), where F is the cumulative distribution function of T i s. Hence, the violation rate (the probability that the population type I error of ϕ k exceeds α) is P[ F(T (k) ) > α] = P[F(T (k) ) < α] = P[T (k) < F ( α)] = P[T () < F ( α),, T (k) < F ( α)] = P[at least k of the T i 's are less than F ( α)] n = P[exactly j of the T i 's are less than F ( α)] j=k n = ( n j ) P[T i < F ( α)] j ( P[T i < F ( α)]) n j j=k n ( n j ) ( α)j α n j, (S) j=k where the last inequality holds because P[T i < F ( α)] α, and it becomes an equality when F is continuous. Remark: The proof does not have any assumptions on F. Regardless of the continuity of F, we define its inverse as F ( ) = inf{x: F(x) }, which has the property: x F(y) if and only if F (x) y, for any x [,] and y in the domain of F.

3 Conditional type II error bounds in NP-ROC Bands Given training data S = S S, where S and S are class and class samples, respectively, we randomly split S into S and S 2 and split S into S and S 2. For simplicity, we let S = S 2 = n and S = S 2 = m, and express the two left-out samples as S 2 = {x,, x n } and S 2 = {X,, X m }. In the following discussion, we treat S, S 2 and S as fixed (by conditioning on them) and only consider the m data points in S 2 as random variables. We train a base classification algorithm (e.g., SVM) on S S and denote the resulting classification scoring function as f. Because f is a function of S S, in our discussion here, f is a fixed function that maps X to R. After applying f to the left-out samples S 2 and S 2, we denote the resulting classification scores as t i = f(x i ), i =,, n, and T j = f(x j ), j =,, m, respectively. Suppose that we decide to use the k-th ordered left-out class score, t (k), as the score threshold. We then construct an NP classifier ϕ k (based on one (M = ) random split) as We then find the corresponding rank of t (k) There are three scenarios. Scenario If T () t (k) ϕ k(x) = I(f(X) > t (k) ). among the left-out class scores T,, T m. T (m), we define the lower bound rank r L and the upper bound rank r U as r L = max{r {,, m}: T (r) t (k) }, (S2) r U = min{r {,, m}: T (r) t (k) }, (S3) and denote their corresponding classifiers as ϕ rl (X) = I(f(X) > T (rl )), (S4) ϕ ru (X) = I(f(X) > T (ru )). (S5) We define the conditional (here the conditioning is on training data S, S 2 and S ) type II errors of ϕ rl and ϕ ru as R c (ϕ rl ) P[f(X ) T (rl ) R c (ϕ ru ) P[f(X ) T (ru ) T (rl ) T (ru ) ] = F (T (rl )), (S6) ] = F (T (ru )), (S7) where X represents a new class observation, and F is the cumulative distribution function of classification scores of class observations. Because T (rl ) t (k) T (ru ), we have R c (ϕ rl ) R c (ϕ k) R c (ϕ ru ), (S8)

4 where R c ( ) stands for the conditional type II error, and R c (ϕ k) = P[f(X ) t (k) ] = F (t (k) ). In (S8), the left inequality becomes tight when T (rl ) = t (k) and the right inequality is tight when T (ru ) = t (k). Given a pre-specified tolerance level δ, denote by β L (ϕ k) and β U (ϕ k) the ( δ) high probability lower and upper bounds of R c (ϕ k). Specifically, β L (ϕ k) and β U (ϕ k) are defined as m β L (ϕ k) sup {β [,]: ( m j ) βj ( β) m j δ}, (S9) j=r L m β U (ϕ k) inf {β [,]: ( m j ) βj ( β) m j δ}. (S) j=r U The reason that (S9) and (S) give valid ( δ) high probability lower and upper bounds is as follows. A constant β serves as a valid ( δ) high probability lower bound of R c (ϕ k) if P[R c (ϕ k) β] δ. Since by (S8) P[R c (ϕ k) β] P[R c (ϕ rl ) β]. in order to have (S) hold it is sufficient to have P[R c (ϕ rl ) β] δ. By (S6) P[R c (ϕ rl ) β] = P[F (T (rl )) β] = P[F (T (rl )) < β] m ( m j ) βj ( β) m j, j=r L where the inequality in the second line follows from (S) in the proof of Proposition. Hence, it suffices to have (S) m ( m j ) βj ( β) m j δ j=r L to make (S) hold. Among all the β values that satisfy (S), we would choose the supremum as the ( δ) high probability lower bound of R c (ϕ k), leading to (S9). A constant β serves as a valid ( δ) high probability upper bound of R c (ϕ k) if P[R c (ϕ k) β] δ. (S2) Since by (S8) P[R c (ϕ k) β] P[R c (ϕ ru ) β], in order to have (S2) hold it is sufficient to have P[R c (ϕ ru ) β] δ.

5 By (S7), P[R c (ϕ ru ) β] = P[F (T (ru )) β] = P[F (T (ru )) < β] = ( m j ) βj ( β) m j j=r U where the two equalities in the second line follow from (S) in the proof of Proposition, under the assumption that F is continuous, which holds for most classification algorithms. Hence, it suffices to have m m ( m j ) βj ( β) m j δ j=r U to make (S2) hold. Among all the β values that satisfy (S2), we would choose the infimum as the ( δ) high probability upper bound of R c (ϕ k), leading to (S). Scenario 2 If t (k) > T (m), we define the lower bound rank r L the same as in (S2) and the ( δ) high probability lower bound β L (ϕ k) the same as in (S9). We set the ( δ) high probability upper bound β U (ϕ k) =. Scenario 3 If t (k) < T (), we define the upper bound rank r U the same as in (S3) and the ( δ) high probability upper bound β U (ϕ k) the same as in (S). We set the ( δ) high probability lower bound β L (ϕ k) =. In all the above three scenarios, we have P[R c (ϕ k) < β L (ϕ k)] δ and P[R c (ϕ k) > β U (ϕ k)] δ, leading to P[β L (ϕ k) R c (ϕ k) β U (ϕ k)] 2δ.

6 Empirical ROC curves versus NP-ROC bands in guiding users to choose classifiers to satisfy type I error control In practice, the popular ROC curves cannot provide proper guidance for comparing two classifiers whose type I errors are bounded from above by some α, because ROC curves are constructed based on empirical type I and type II errors, which are calculated from test data or cross-validation on training data and do not display population type I error information. We refer to such ROC curves as empirical ROC curves in the main text to differentiate them from the oracle ROC curves. Concretely, given an empirical ROC curve of a classification method, it is unclear how users should decide which point on the curve corresponds to a classifier satisfying the type I error bound α. Through Simulation and Fig. 2, we showed that users cannot simply pick the point that has empirical type I error (i.e., horizontal axis of the ROC curve) no greater than and closest to α, a seemingly intuitive but actually improper practice. We further illustrate this point in Simulation S and Fig. 3. Due to the lack of direct information on population type I errors, existing methods for constructing ROC confidence bands cannot serve the purpose either. Simulation S. From the same setup as in Simulation : (X Y = ) N(,) and (X Y = ) N(2,), with P(Y = ) = P(Y = ) =.5 N we simulate 2D = 2 data sets {(x (m) i, y (m) i }, where m =,,2D and N =. The i= first D data sets are used as the training data, and the other D data sets are the test data. On the m-th training data set, we construct N classifiers, I(X > x (m) i ), i =,, N, and evaluate their empirical type I and II errors on the m-th test data set (i.e., the (D + m)-th simulated data set), resulting in one ROC curve. We also use the m-th training data set to calculate one NP-ROC lower curve, i.e., the lower curve of an NP-ROC band. Fig. 3 illustrates the D = ROC and NP-ROC lower curves. Suppose that users would like to find a classifier respecting a type I error bound α =.5 with tolerance level δ =.5 from an ROC curve. An intuitive choice is to pick the classifier at the intersection of the ROC curve and the vertical line at α. If there is no classifier right at the intersection, a reasonable idea is to pick the first classifier to the left of the intersection. For the D classifiers chosen in this way, we summarize their empirical type I errors (on the test data) and their population type I errors as histograms (Fig. 3 left panel). The results suggest that although the classifiers have no empirical type I errors greater than α, approximately 3% of the classifiers have population type I errors greater than α, violating users desire for controlling type I error under α with at least.95 probability. On the other hand, the NP-ROC lower curves provide a natural way for users to choose classifiers given α, as the horizontal coordinates of the NP-ROC curves are type I error upper bounds. Users can simply pick the classifier with horizontal coordinate α. For the D chosen NP classifiers, we summarize their empirical type I errors on the test data and their population type I errors as histograms (Fig. 3 right panel). It is clear that the violation rate of population type I error is under δ =.5. Another use of the NP-ROC lower curve is that it provides a conservative point-wise estimate of the oracle ROC curve. For an NP classifier ϕ, the corresponding point on an NP-ROC lower curve is, with high probability, below and to the right of the point on the oracle ROC curve. To explain this phenomenon, suppose that the classifier corresponds to the point (α(ϕ ), β U (ϕ )) on an NP-ROC lower curve and the point (R (ϕ ), R (ϕ )) on an oracle ROC curve. Then, by the definition of α(ϕ ) (Equation (3)) and β U (ϕ ) (Equation (S)) as the high probability upper bounds on R (ϕ ) and R c (ϕ ) (the conditional type II error, conditioning on the training data), respectively, we know that α(ϕ ) R (ϕ ) and β U (ϕ ) R c (ϕ ) hold with high

7 probability. As R (ϕ ) = E[R c (ϕ ) ], where the expectation is with respect to the joint distribution of the training data, we have β U (ϕ ) R (ϕ ) with high probability. Hence, the point on the NP-ROC lower curve is below and to the right of the oracle ROC curve with high probability. In other words, for an NP classifier, coordinates on the NP-ROC lower curve provide a conservative estimate of the corresponding coordinates on the oracle ROC curve. This phenomenon is visualized in the top right panel of Fig. 3.

8 Effects of majority voting on the type I and II errors of the ensemble classifier In the following simulation, we demonstrate that the ensemble classifiers from multiple random splits maintain the type I error violation rates under δ and achieve reduced average type II errors with smaller standard errors over multiple simulations, as compared with the NP classifiers from just one random split. Simulation S2. Our first generative model is a logistic regression (LR) model with d = 3 independent features and n = observations. For each feature, values are independently drawn from a standard Normal distribution. The three features have non-zero coefficients as 3, 2.4 and.8, respectively. Denoting the feature matrix by X = [x,, x n ] T R n d and the coefficient vector by β R d, we simulate the response vector as Y = (Y,, Y n ) T, with Y i indep Bernoulli ( ). +exp( x T i β) Our second generative model is a linear discriminant analysis (LDA) model with d = 3 independent features and n = observations. We first simulate the response vector as Y = (Y,, Y n ) T, with Y i indep Bernoulli(.5). Then we simulate the feature matrix X = [x,, x n ] T R n d as (x i Y i = ) N((,, ) T, I 3 ); (x i Y i = ) N((2,.6,.2) T, I 3 ). We simulate datasets from each of the above two models, obtaining 2 datasets. With α =.5, δ =.5, and varying number of splits M {, 5, 9,,5}, we construct five NP classifiers for each of six methods (logistic regression, support vector machines, random forests, naïve Bayes, linear discriminant analysis, and AdaBoost) with different numbers of splits based on each data set. We separately simulate two large test data sets with 6 observations from the two models, so that we can use the test data set to calculate the test (empirical) type I and type II errors to approximate the population type I and type II errors. For each combination of M value, classification method and generative model, we use its corresponding NP classifiers to calculate the average of the approximate type I errors and its standard error, the percentage of classifiers with approximate type I errors greater than α (type I error violation rate), and the average of the approximate type II errors and its standard error. The results summarized in tables S-S6 show that the type I error violation rates of the ensemble classifiers stay below δ = 5%, the average type I and type II errors, which approximate the average of the population type I and type II errors of the ensemble classifiers, and their standard deviations decrease from M = to M >. This simulation experiment and its results demonstrate the validity and effectiveness of the ensemble approach by majority voting.

9 table S. Results of LR in Simulation S2. The first column indicates the data generative model (LR: logistic regression; LDA: linear discriminant analysis). The second column indicates the number of random splits on the class training data to construct each ensemble classifier. The third column indicates the (approximate) averages (avg) and standard deviations (sd) of the population type I errors of the ensemble classifiers. The fourth column indicates the proportions of the (approximate) population type I errors exceeding α, i.e., the violation rates. The fifth column indicates the (approximate) averages (avg) and standard deviations (sd) of the population type II errors of the ensemble classifiers. Model M Type I error avg (sd) Type I error violation rate Type II error avg (sd) 2.78% (.%) 2.3% 33.42% (5.9%) % (.8%).% 33.38% (4.2%) LR 9 2.7% (.75%).5% 33.35% (3.89%) 2.7% (.74%).5% 33.37% (3.86%) 5 2.7% (.73%).5% 33.35% (3.8%) 2.82% (.6%) 3.2% 9.22% (4.62%) % (.82%).7% 9.4% (3.63%) LDA % (.78%).4% 9.3% (3.39%) 2.72% (.77%).4% 9.9% (3.33%) % (.75%).4% 9.% (3.27%) table S2. Results of SVMs in Simulation S2. The column meanings are the same as those of table S. Model M Type I error avg (sd) Type I error violation rate Type II error avg (sd) 2.8% (.4%) 3.% 43.22% (9.4%) 5 2.7% (.77%).9% 42.85% (7.53%) LR % (.73%).5% 42.76% (7.25%) 2.68% (.72%).7% 42.8% (7.2%) % (.7%).5% 42.74% (7.5%) 2.8% (.8%) 3.4% 34.% (5.42%) % (.78%).9% 33.25% (2.95%) LDA % (.76%).6% 33.5% (2.75%) 2.72% (.75%).6% 33.% (2.73%) 5 2.7% (.74%).5% 33.5% (2.58%)

10 table S3. Results of RFs in Simulation S2. The column meanings are the same as those of table S. Model M Type I error avg (sd) Type I error violation rate Type II error avg (sd) 2.79% (.3%) 2.7% 45.8% (7.52%) % (.72%).2% 45.5% (5.52%) LR % (.66%).% 44.97% (5.%) 2.3% (.65%).% 44.96% (5.4%) % (.64%).% 44.93% (5.%) 2.8% (.8%) 3.3% 26.% (7.6%) 5 2.6% (.75%).2% 25.8% (4.92%) LDA % (.73%).4% 24.84% (4.6%) 2.57% (.72%).4% 24.8% (4.52%) % (.7%).3% 24.77% (4.42%) table S4. Results of NB in Simulation S2. The column meanings are the same as those of table S. Model M Type I error avg (sd) Type I error violation rate Type II error avg (sd) 2.79% (.%) 2.% 35.3% (5.32%) % (.76%).5% 35.8% (4.%) LR % (.72%).4% 35.3% (3.87%) 2.67% (.7%).3% 35.5% (3.84%) % (.7%).2% 35.% (3.75%) 2.8% (.5%) 3.5% 9.2% (4.58%) % (.82%).8% 9.% (3.62%) LDA % (.77%).4% 9.6% (3.35%) 2.72% (.75%).4% 9.3% (3.28%) 5 2.7% (.75%).4% 9.5% (3.24%)

11 table S5. Results of LDA in Simulation S2. The column meanings are the same as those of table S. Model M Type I error avg (sd) Type I error violation rate Type II error avg (sd) 2.78% (.2%) 3.% 33.48% (5.9%) % (.8%).4% 33.43% (4.4%) LR 9 2.7% (.75%).8% 33.42% (3.9%) 2.7% (.74%).5% 33.44% (3.85%) 5 2.7% (.73%).3% 33.43% (3.82%) 2.8% (.6%) 3.6% 9.5% (4.59%) % (.82%).7% 9.% (3.64%) LDA % (.77%).4% 9.5% (3.36%) 2.72% (.76%).3% 9.4% (3.29%) 5 2.7% (.75%).4% 9.5% (3.25%) table S6. Results of AdaBoost in Simulation S2. The column meanings are the same as those of table S. Model M Type I error avg (sd) Type I error violation rate Type II error avg (sd) 2.8% (.2%) 2.9% 4.5% (6.2%) % (.7%).% 4.87% (4.5%) LR % (.65%).% 4.74% (4.9%) 2.3% (.64%).% 4.77% (4.4%) % (.62%).% 4.72% (4.6%) 2.8% (.9%) 3.6% 24.39% (6.2%) % (.74%).8% 23.39% (4.2%) LDA 9 2.5% (.7%).5% 23.2% (3.93%) 2.5% (.7%).3% 23.7% (3.86%) % (.68%).2% 23.% (3.78%)

12 table S7. Description of variables used in real data application. The data and description are from the Early Warning Project ( Variable Description Values Response mkl.start. Onset of state-led mass killing episode in next year (t + ) {, } Predictors reg.afr US Dept State region: Sub-Saharan Africa {, } reg.eap US Dept State region: East Asia and Pacific {, } reg.eur US Dept State region: Europe and Eurasia {, } reg.mna US Dept State region: Middle East and North Africa {, } reg.sca US Dept State region: South and Central Asia {, } reg.amr US Dept State region: Americas {, } mkl.ongoing Any ongoing episodes of state-led mass killing {, } mkl.ever Any state-led mass killing since WWII (cumulative) {, } countryage.ln Country age, logged [, ] wdi.popsize.ln Population size, logged [4.7889, ] imr.normed.ln Infant mortality rate relative to annual global median, logged [ , ] gdppcgrow.sr Annual % change in GDP per capita, meld of IMF and WDI, square root [ , ] wdi.trade.ln Trade openness, logged [ , ] ios.iccpr ICCPR st Optional Protocol signatory {, } postcw Post-Cold War period (year 99) {, } pol.cat.fl Autocracy (Fearon and Laitin) {, } pol.cat.fl2 Anocracy (Fearon and Laitin) {, } pol.cat.fl3 Democracy (Fearon and Laitin) {, } pol.cat.fl7 Other (Fearon and Laitin) {, } pol.durable.ln Regime duration, logged (Polity) [, ] dis.l4pop.ln Percent of population subjected to state-led discrimination, logged [, ] elf.ethnicc Ethnic fractionalization: low {, } elf.ethnicc2 Ethnic fractionalization: medium {, } elf.ethnicc3 Ethnic fractionalization: high {, } elf.ethnicc9 Ethnic fractionalization: missing {, } elc.eleth Salient elite ethnicity: majority rule {, } elc.eleth2 Salient elite ethnicity: minority rule {, } elc.eliti Ruling elites espouse an exclusionary ideology {, } cou.tries5d Any coup attempts in past 5 years ((t 4) to (t)) {, } pit.sftpuhvl2..ln Sum of max annual magnitudes of PITF instability other than [, 4.586] genocide from past yrs ((t 9) to (t)), logged mev.regac.ln Scalar measure of armed conflict in geographic region, logged [, ] mev.civtot.ln Scale of violent civil conflict, logged [, ]

13 table S8. The performance of the NP umbrella algorithm in real data application 2. Given α =. and δ =., after randomly splitting the data into training data with a size 374 (3/4 of the observations) and test data with a size 24 (/4 of the observations) for times, we calculate the empirical type I and type II errors on the test data. The violation rates of empirical type I errors (the percentage of empirical type I errors greater than α) of different approaches are summarized in the 2nd column. The averages and standard errors of empirical type II errors in percentages are summarized in the 3rd column. Three classification methods: penalized logistic regression (penlr), random forests (RF), and naïve Bayes (NB) are considered here. For each method, we considered three ways to choose the cutoff: default, the classical approach to minimize the overall classification error; naïve, the naïve approach to choose the threshold whose empirical type I error on the training data is no greater than and closest to α; NP, the NP umbrella approach. Type I error avg (se) Type I error violation rate Type II error avg (se) penlr (default) 9.35% (.4%) 42% 3.75% (.7%) penlr (naïve) 2.53% (.22%) 94%.25% (.4%) penlr (NP) 4.48% (.%) 7% 6.47% (.9%) RF (default) 2.45% (.6%) 66% 5.72% (.8%) RF (naïve) 53.5% (.24%) %.34% (.2%) RF (NP) 5.9% (.2%) % 2.88% (.2%) NB (default).6% (.4%) 54% 4.43% (.%) NB (naïve).9% (.7%) 57% 8.32% (.62%) NB (NP).87% (.%) 4% 76.27% (.6%)

14 table S9. Input information of the nproc package (version 2..9). Argument x y method Description an n d design matrix with n observations and d covariates an n dimensional vector with binary responses The base classification method. The following options are provided in nproc package version 2..9: logistic : logistic regression penlog : penalized logistic regression with LASSO penalty, depending on the glmnet package version 2..5 svm : support vector machines, depending on the e7 package version.6.7. Arguments of the svm function can be passed from the npc and nproc functions randomforest : random forests, depending on the e7 package version.6.7. Arguments of the randomforest function can be passed from the npc and nproc functions lda : linear discriminant analysis nb : naïve Bayes, depending on the e7 package version.6.7 ada : AdaBoost, depending on the ada package version 2.-5 alpha type I error upper bound, with default value.5 delta the violation rate of type I error, with default value.5 split the number of splits M, with default value

Neyman-Pearson (NP) Classification Algorithms and NP Receiver Operating Characteristics (NP-ROC)

Neyman-Pearson (NP) Classification Algorithms and NP Receiver Operating Characteristics (NP-ROC) s Neyman-Pearson (NP) Classification Algorithms and NP Receiver Operating Characteristics (NP-ROC) (Short title: NP Classification and NP-ROC) Xin Tong, 1 Yang Feng, 2 Jingyi Jessica Li 3 1 Department

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome. Classification Classification is similar to regression in that the goal is to use covariates to predict on outcome. We still have a vector of covariates X. However, the response is binary (or a few classes),

More information

Supplemental Information

Supplemental Information Supplemental Information Rewards for Ratification: Payoffs for Participating in the International Human Rights Regime? Richard Nielsen Assistant Professor, Department of Political Science Massachusetts

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature

More information

Stat 502X Exam 2 Spring 2014

Stat 502X Exam 2 Spring 2014 Stat 502X Exam 2 Spring 2014 I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed This exam consists of 12 parts. I'll score it at 10 points per problem/part

More information

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Learning Classification with Auxiliary Probabilistic Information Quang Nguyen Hamed Valizadegan Milos Hauskrecht

Learning Classification with Auxiliary Probabilistic Information Quang Nguyen Hamed Valizadegan Milos Hauskrecht Learning Classification with Auxiliary Probabilistic Information Quang Nguyen Hamed Valizadegan Milos Hauskrecht Computer Science Department University of Pittsburgh Outline Introduction Learning with

More information

Statistical Methods for SVM

Statistical Methods for SVM Statistical Methods for SVM Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot,

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot, we get creative in two

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Large-Margin Thresholded Ensembles for Ordinal Regression

Large-Margin Thresholded Ensembles for Ordinal Regression Large-Margin Thresholded Ensembles for Ordinal Regression Hsuan-Tien Lin (accepted by ALT 06, joint work with Ling Li) Learning Systems Group, Caltech Workshop Talk in MLSS 2006, Taipei, Taiwan, 07/25/2006

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Classification 1: Linear regression of indicators, linear discriminant analysis

Classification 1: Linear regression of indicators, linear discriminant analysis Classification 1: Linear regression of indicators, linear discriminant analysis Ryan Tibshirani Data Mining: 36-462/36-662 April 2 2013 Optional reading: ISL 4.1, 4.2, 4.4, ESL 4.1 4.3 1 Classification

More information

Introduction to Signal Detection and Classification. Phani Chavali

Introduction to Signal Detection and Classification. Phani Chavali Introduction to Signal Detection and Classification Phani Chavali Outline Detection Problem Performance Measures Receiver Operating Characteristics (ROC) F-Test - Test Linear Discriminant Analysis (LDA)

More information

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane DATA MINING AND MACHINE LEARNING Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Linear models for regression Regularized

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report Hujia Yu, Jiafu Wu [hujiay, jiafuwu]@stanford.edu 1. Introduction Housing prices are an important

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning

More information

Gradient Boosting (Continued)

Gradient Boosting (Continued) Gradient Boosting (Continued) David Rosenberg New York University April 4, 2016 David Rosenberg (New York University) DS-GA 1003 April 4, 2016 1 / 31 Boosting Fits an Additive Model Boosting Fits an Additive

More information

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. CS 189 Spring 013 Introduction to Machine Learning Final You have 3 hours for the exam. The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet. Please

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

Random projection ensemble classification

Random projection ensemble classification Random projection ensemble classification Timothy I. Cannings Statistics for Big Data Workshop, Brunel Joint work with Richard Samworth Introduction to classification Observe data from two classes, pairs

More information

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler + Machine Learning and Data Mining Bayes Classifiers Prof. Alexander Ihler A basic classifier Training data D={x (i),y (i) }, Classifier f(x ; D) Discrete feature vector x f(x ; D) is a con@ngency table

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Announcements Kevin Jamieson

Announcements Kevin Jamieson Announcements My office hours TODAY 3:30 pm - 4:30 pm CSE 666 Poster Session - Pick one First poster session TODAY 4:30 pm - 7:30 pm CSE Atrium Second poster session December 12 4:30 pm - 7:30 pm CSE Atrium

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Multivariate statistical methods and data mining in particle physics

Multivariate statistical methods and data mining in particle physics Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general

More information

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function

More information

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi Boosting CAP5610: Machine Learning Instructor: Guo-Jun Qi Weak classifiers Weak classifiers Decision stump one layer decision tree Naive Bayes A classifier without feature correlations Linear classifier

More information

Discriminative Learning and Big Data

Discriminative Learning and Big Data AIMS-CDT Michaelmas 2016 Discriminative Learning and Big Data Lecture 2: Other loss functions and ANN Andrew Zisserman Visual Geometry Group University of Oxford http://www.robots.ox.ac.uk/~vgg Lecture

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Statistical Methods for Data Mining

Statistical Methods for Data Mining Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find

More information

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression ECE521: Inference Algorithms and Machine Learning University of Toronto Assignment 1: k-nn and Linear Regression TA: Use Piazza for Q&A Due date: Feb 7 midnight, 2017 Electronic submission to: ece521ta@gmailcom

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2.

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2. CS229 Problem Set #2 Solutions 1 CS 229, Public Course Problem Set #2 Solutions: Theory Kernels, SVMs, and 1. Kernel ridge regression In contrast to ordinary least squares which has a cost function J(θ)

More information

VBM683 Machine Learning

VBM683 Machine Learning VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data

More information

Support Vector Machines, Kernel SVM

Support Vector Machines, Kernel SVM Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences Biostatistics-Lecture 16 Model Selection Ruibin Xi Peking University School of Mathematical Sciences Motivating example1 Interested in factors related to the life expectancy (50 US states,1969-71 ) Per

More information

A Bias Correction for the Minimum Error Rate in Cross-validation

A Bias Correction for the Minimum Error Rate in Cross-validation A Bias Correction for the Minimum Error Rate in Cross-validation Ryan J. Tibshirani Robert Tibshirani Abstract Tuning parameters in supervised learning problems are often estimated by cross-validation.

More information

SUPPORT VECTOR MACHINE

SUPPORT VECTOR MACHINE SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition

More information

Classification using stochastic ensembles

Classification using stochastic ensembles July 31, 2014 Topics Introduction Topics Classification Application and classfication Classification and Regression Trees Stochastic ensemble methods Our application: USAID Poverty Assessment Tools Topics

More information

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

Lecture Notes 15 Prediction Chapters 13, 22, 20.4. Lecture Notes 15 Prediction Chapters 13, 22, 20.4. 1 Introduction Prediction is covered in detail in 36-707, 36-701, 36-715, 10/36-702. Here, we will just give an introduction. We observe training data

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

Ethnic Polarization, Potential Conflict, and Civil Wars

Ethnic Polarization, Potential Conflict, and Civil Wars Ethnic Polarization, Potential Conflict, and Civil Wars American Economic Review (2005) Jose G. Montalvo Marta Reynal-Querol October 6, 2014 Introduction Many studies on ethnic diversity and its effects

More information

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel Lecture 2 Judging the Performance of Classifiers Nitin R. Patel 1 In this note we will examine the question of how to udge the usefulness of a classifier and how to compare different classifiers. Not only

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018 Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative

More information

Introduction to Gaussian Process

Introduction to Gaussian Process Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

More information

Pattern Recognition 2018 Support Vector Machines

Pattern Recognition 2018 Support Vector Machines Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 48 Support Vector Machines Ad Feelders ( Universiteit Utrecht

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Short Note: Naive Bayes Classifiers and Permanence of Ratios

Short Note: Naive Bayes Classifiers and Permanence of Ratios Short Note: Naive Bayes Classifiers and Permanence of Ratios Julián M. Ortiz (jmo1@ualberta.ca) Department of Civil & Environmental Engineering University of Alberta Abstract The assumption of permanence

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

Introduction to Logistic Regression

Introduction to Logistic Regression Introduction to Logistic Regression Problem & Data Overview Primary Research Questions: 1. What are the risk factors associated with CHD? Regression Questions: 1. What is Y? 2. What is X? Did player develop

More information

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016 12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses

More information

LDA, QDA, Naive Bayes

LDA, QDA, Naive Bayes LDA, QDA, Naive Bayes Generative Classification Models Marek Petrik 2/16/2017 Last Class Logistic Regression Maximum Likelihood Principle Logistic Regression Predict probability of a class: p(x) Example:

More information

Feature Engineering, Model Evaluations

Feature Engineering, Model Evaluations Feature Engineering, Model Evaluations Giri Iyengar Cornell University gi43@cornell.edu Feb 5, 2018 Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 1 / 35 Overview 1 ETL 2 Feature Engineering

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Ensemble Methods and Random Forests

Ensemble Methods and Random Forests Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization

More information

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Lecture 4 Discriminant Analysis, k-nearest Neighbors Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

Advanced Statistical Methods: Beyond Linear Regression

Advanced Statistical Methods: Beyond Linear Regression Advanced Statistical Methods: Beyond Linear Regression John R. Stevens Utah State University Notes 3. Statistical Methods II Mathematics Educators Worshop 28 March 2009 1 http://www.stat.usu.edu/~jrstevens/pcmi

More information

Lecture 6: Methods for high-dimensional problems

Lecture 6: Methods for high-dimensional problems Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,

More information

Lecture 5: LDA and Logistic Regression

Lecture 5: LDA and Logistic Regression Lecture 5: and Logistic Regression Hao Helen Zhang Hao Helen Zhang Lecture 5: and Logistic Regression 1 / 39 Outline Linear Classification Methods Two Popular Linear Models for Classification Linear Discriminant

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II) Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Slide Set 3: Detection Theory January 2018 Heikki Huttunen heikki.huttunen@tut.fi Department of Signal Processing Tampere University of Technology Detection theory

More information

SUPPLEMENTARY MATERIAL COBRA: A Combined Regression Strategy by G. Biau, A. Fischer, B. Guedj and J. D. Malley

SUPPLEMENTARY MATERIAL COBRA: A Combined Regression Strategy by G. Biau, A. Fischer, B. Guedj and J. D. Malley SUPPLEMENTARY MATERIAL COBRA: A Combined Regression Strategy by G. Biau, A. Fischer, B. Guedj and J. D. Malley A. Proofs A.1. Proof of Proposition.1 e have E T n (r k (X)) r? (X) = E T n (r k (X)) T(r

More information

Machine Learning Gaussian Naïve Bayes Big Picture

Machine Learning Gaussian Naïve Bayes Big Picture Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 27, 2011 Today: Naïve Bayes Big Picture Logistic regression Gradient ascent Generative discriminative

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

The Naïve Bayes Classifier. Machine Learning Fall 2017

The Naïve Bayes Classifier. Machine Learning Fall 2017 The Naïve Bayes Classifier Machine Learning Fall 2017 1 Today s lecture The naïve Bayes Classifier Learning the naïve Bayes Classifier Practical concerns 2 Today s lecture The naïve Bayes Classifier Learning

More information

ECE 661: Homework 10 Fall 2014

ECE 661: Homework 10 Fall 2014 ECE 661: Homework 10 Fall 2014 This homework consists of the following two parts: (1) Face recognition with PCA and LDA for dimensionality reduction and the nearest-neighborhood rule for classification;

More information

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

A Study of Relative Efficiency and Robustness of Classification Methods

A Study of Relative Efficiency and Robustness of Classification Methods A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics

More information

Linear and Logistic Regression. Dr. Xiaowei Huang

Linear and Logistic Regression. Dr. Xiaowei Huang Linear and Logistic Regression Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Two Classical Machine Learning Algorithms Decision tree learning K-nearest neighbor Model Evaluation Metrics

More information