# Machine Learning Linear Classification. Prof. Matteo Matteucci

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Machine Learning Linear Classification Prof. Matteo Matteucci

2 Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X) Clustering Partitions

3 Example: Default dataset 3 Overall Default Rate 3%

4 Linear regression for classification? 4 o Suppose to predict the medical condition of a patient. How should this be encoded? We could use dummy variables in case of binary output but how to deal with multiple output? Different encodings could result in different models

5 Recall here the Bayesian Classifier 5 o For a classification problem we can use the error rate i.e. I Error y i yˆ ( i Rate ) n i 1 I( yi yˆ i ) / Where is an indicator function, which will give 1 if the condition y yˆ ) is correct, otherwise it gives a 0. ( i i The error rate represents the fraction of incorrect classifications, or misclassifications o The Bayes Classifier minimizes the Average Test Error Rate max j P( Y j X x0) o The Bayes error rate refers to the lowest possible Error Rate achievable knowing the true distribution of the data n The best classifier possible estimates the class posterior probability!!

6 Logistic Regression 6 o We want to model the probability of the class given the input but a linear model has some drawbacks

7 Example: Default data & Linear Regression 7 Overall Default Rate 3% Negative probability?

8 Logistic Regression 8 o We want to model the probability of the class given the input but a linear model has some drawbacks o Logistic regression solves the negative probability (ad other issues as well) by regressing the logistic function

9 Example: Default data & Logistic Regression 9

10 Logistic Regression 10 o We want to model the probability of the class given the input but a linear model has some drawbacks (see later slide) Linear Regression o Logistic regression solves the negative probability (ad other issues as well) by regressing the logistic function Logistic Regression from this we derive This is called odds and taking logarithms This is called log-odds or logit

11 Coefficient interpretation 11 o Interpreting what 1 means is not very easy with logistic regression, simply because we are predicting P(Y) and not Y. If 1 =0, this means that there is no relationship between Y and X If 1 >0, this means that when X gets larger so does the probability that Y = 1 If 1 <0, this means that when X gets larger, the probability that Y = 1 gets smaller. o But how much bigger or smaller depends on where we are on the slope, i.e., it is not linear

12 Training Logistic Regression (1/4) 12 o For the basic logistic regression wee need two parameters o In principle we could use (non linear) Least Squares fitting on the observed data the corresponding model o But a more principled approach for training in classification problems is based on Maximum Likelihood We want to find the parameters which maximize the likelihood function

13 Maximum Likelihood flash-back (1/6) 13 o Suppose we observe some i.i.d. samples coming from a Gaussian distribution with known variance: Which distribution do you prefer?

14 Maximum Likelihood flash-back (2/6) 14 o There is a simple recipe for Maximum Likelihood estimation o Let s try to apply it to our example

15 Maximum Likelihood flash-back (3/6) 15 o Let s try to apply it to our example 1. Write le likelihood for the data

16 Maximum Likelihood flash-back (4/6) 16 o Let s try to apply it to our example 2. (Take the logarithm of the likelihood -> log-likelihood)

17 Maximum Likelihood flash-back (5/6) 17 o Let s try to apply it to our example 3. Work out the derivatives using high-school calculus

18 Maximum Likelihood flash-back (6/6) 18 o Let s try to apply it to our example 4. Solve the unconstrained equations

19 Training Logistic Regression (1/4) 19 o For the basic logistic regression we need two parameters o In principle we could use (non linear) Least Squares fitting on the observed data the corresponding model o But a more principled approach for training in classification problems is based on Maximum Likelihood We want to find the parameters which maximize the likelihood function

20 Training Logistic Regression (2/4) 20 o Let s find the parameters which maximize the likelihood function o If we compute the log-likelihood for N observations Taken from ESL where o We obtain a log-likelihood in the form of Can you derive it?

21 Training Logistic Regression (3/4) 21 o Let s find the parameters which maximize the likelihood function Z-statistics has the same role of the regression t-statistics, a large value means the parameter is not null Intercept does not have a particular meaning is used to adjust the probability to class proportions

22 Example: Default data & Logistic Regression 22

23 Training Logistic Regression (4/4) 23 o Let s find the parameters which maximize the likelihood function We can train the model using qualitative variables through the use of binary (dummy) variables

24 Making predictions with Logistic Regression 24 o Once we have the model parameters we can predict the class o The Default probability having 1000\$ balance is <1% while with a balance of 2000\$ this becomes 58.6% o With qualitative variables, i.e., dummy variables, we get that being a students results in

25 Multiple Logistic Regression 25 o So far we have considered only one predictor, but we can extend the approach to multiple regressors o By maximum likelihood we learn the corresponding parameters What about this?

26 An apparent contradiction 26 Positive Negative!!!

27 Example: Confounding in Default data set 27

28 Example: South African Heart Disease 28 Taken from ESL

29 Logistic Regression for Feature Selection 29 o If we fit the complete model on these data we get Taken from ESL o While if we use stepwise Logistic Regression Taken from ESL

30 Logistic Regression parameters meaning 30 o Regression parameters represent the increment on the logit of probability given by a unitary increment of a variable o Let consider the increase of tobacco consumption in life od 1Kg, this count for an increase in log-odds of exp(0.081)=1.084 which means an overall increase of 8.4% o With a 95% confidence interval Taken from ESL

31 Regularized Logistic Regression 31 o As for Linear Regression we can compute a Lasso version

32 Regularized Logistic Regression 32 o As for Linear Regression we can compute a Lasso version Taken from ESL

33 Multiclass Logistic Regression 33 o Logistic Regression extends naturally to multiclass problems by computing the log-odds w.r.t. the K th class Comes from ESL, but it s worth knowing!!! Notation different because it comes from ESL o This is equivalent to o Can you prove it?!?!?

34 Wrap-up on Logistic Regression 34 o We model the log-odds as a linear regression model o This means the posterior probability becomes o Parameters represent log-odds increase per variable unit increment keeping fixed the others o We can use it to perform feature selection using z-scores and forward stepwise selection o The class decision boundary is linear, but points close to the boundary count more this will be discussed later

35 Beyond Logistic Regression 35 o Logistic Regression models directly class posterior probability o Linear Discriminant Analysis uses the Bayes Theorem o What improvements come with this model? Parameter learning unstable in Logistic Regression for well separated classes With little data and normal predictor distribution LDA is more stable A very popular algorithm with more than 2 response classes

36 Linear Discriminant Analysis (1/3) 36 o Suppose we want to discriminate among K>2 classes o Each class has a prior probability o Given the class we model the density function of predictors as o Using the Bayes Theorem we obtain Prior probability is relatively simple to learn Likelihood might be more tricky and we need some assumptions to simplify it o If we correctly estimate the likelihood we obtain the Bayes Classifier, i.e., the one with the smallest error rate!

37 Linear Discriminant Analysis (2/3) 37 o Let assume p=1 and use a Gaussian distribution o Let assume all classes have the same covariance o The posteriors probability as computed by LDA becomes o The selected class is the one with the highest posterior which is equivalent to the highest discriminating function Can you derive this? Linear discriminant function in x

38 Linear Discriminant Analysis (3/3) 38 o With 2 classes having the same prior probability we decide the class according to the inequality o The Bayes decision boundary corresponds to o Training is as simple as estimating the model parameters

39 LDA Simple Example (with p=1) 39

40 Linear Discriminant Analysis with p>1 (1/3) 40 o In case p>1 we assume comes from

41 Linear Discriminant Analysis with p>1 (2/3) 41 o In the case of p>1 the LDA classifier assumes Observations from the k-th class are drawn from The covariance structure is common to all classes o The Bayes discriminating function becomes Can you derive this? Still linear in x!!! o From this we can compute the boundary between each class (considering the two classes having the same prior probability) o Training formulas for the LDA parameters are similar to the case of p=1

42 Linear Discriminant Analysis Example 42

43 Example: LDA on the Default Dataset 43 o LDA on the Default dataset gets 2.75% training error rate Having records and p=3 we do not expect much overfitting by the way how many parameters we have? Being 3.33% the number of defaulters a dummy classifier would get a similar error rate

44 On the performance of LDA 44 o Errors in classification are often reported as a Confusion Matrix Sensitivity: percentage of true defaulters Specificity: percentage of non-defaulters correctly identified o The Bayes classifier optimize the overall error rate independently from the class they belong to an it does this by thresholding o Can we improve on this?

45 Example: Increasing LDA Sensitivity 45 o We might want to improve classifier sensitivity with respect to a given class because we consider it more critical Reduced Default error rate from 75.7% to 41.4% Increased overall error of 3.73% (but it is worth)

46 Tweaking LDA sensitivity 46 The right choice comes from domain knowledge

47 ROC Curve 47 o The ROC (Receiver Operating Characteristics) summarizes false positive and false negative errors o Obtained by testing all possible thresholds Overall performance given by Area Under the ROC Curve A classifier which randomly guesses (with two classes) has an AUC = 0.5 a perfect classifier has AUC = 1 o ROC curve considers true positive and false positive rates Sensitivity is equivalent to true positive rate Specificity is equivalent to 1 false positive rate

48 ROC Curve for LDA on Default data 48 All subjects are defaulters No subject is defaulter ROC Curve for Logistic Regression is ~ the same

49 Clearing out terminology 49 o When applying a classifier we can obtain o Out of this we can define the following

50 Quadratic Discriminant Analysis 50 o Linear Discriminant Analysis assumes all classes having a common covariance structure o Quadratic Discriminant Analysis assumes different covariances o Under this hypothesis the Bayes discriminant function becomes Can you derive this? Quadratic function o The decision LDA vs. QDA boils down to bias-variance trade-off QDA requires parameters while LDA only

51 QDA vs LDA Example 51

52 Which classifier is better? 52 o Let consider 2 classes and 1 predictor It can be seen that for LDA the log odds is given by While for Logistic Regression the log odds is Both linear functions but learning procedures are different o Linear Discriminant Analysis is the Optimal Bayes if its hypothesis holds otherwise Logistic Regression can outperforms it! o Quadratic Discriminant Analysis is to be preferred if the class covariances are different and we have a non linear boundary

53 Some scenarios tested 53 o Linear Boundary Scenarios 1. Samples from 2 uncorrelated normal distributions 2. Samples from 2 slightly correlated normal distributions 3. Samples from t-student distributed classes

54 Some other scenarios tested 54 o Non Linear Boundary Scenarios 4. Samples from 2 normal distribution with different correlation 5. Samples from 2 normals, predictors are quadratic functions 6. As previous but with a more complicated function

55 Overall conclusion on the comparison 55 o No method is better than all the others In the decision boundary is linear then LDA and Logistic Regression are those performing better When the decision boundary is moderately non linear QDA may give better results For much complex decision boundaries non parametric approaches such as KNN perform better, but the right level of smoothness has to be chosen

### COMS 4771 Regression. Nakul Verma

COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the

### COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released

### Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

+ Machine Learning and Data Mining Bayes Classifiers Prof. Alexander Ihler A basic classifier Training data D={x (i),y (i) }, Classifier f(x ; D) Discrete feature vector x f(x ; D) is a con@ngency table

### Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report Hujia Yu, Jiafu Wu [hujiay, jiafuwu]@stanford.edu 1. Introduction Housing prices are an important

### 10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

10-810: Advanced Algorithms and Models for Computational Biology Optimal leaf ordering and classification Hierarchical clustering As we mentioned, its one of the most popular methods for clustering gene

### Midterm, Fall 2003

5-78 Midterm, Fall 2003 YOUR ANDREW USERID IN CAPITAL LETTERS: YOUR NAME: There are 9 questions. The ninth may be more time-consuming and is worth only three points, so do not attempt 9 unless you are

### L11: Pattern recognition principles

L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

### Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Article from Predictive Analytics and Futurism July 2016 Issue 13 Regression and Classification: A Deeper Look By Jeff Heaton Classification and regression are the two most common forms of models fitted

### Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Behavioral Data Mining Lecture 7 Linear and Logistic Regression Outline Linear Regression Regularization Logistic Regression Stochastic Gradient Fast Stochastic Methods Performance tips Linear Regression

### Tanagra Tutorials. The classification is based on a good estimation of the conditional probability P(Y/X). It can be rewritten as follows.

1 Topic Understanding the naive bayes classifier for discrete predictors. The naive bayes approach is a supervised learning method which is based on a simplistic hypothesis: it assumes that the presence

### Midterm: CS 6375 Spring 2015 Solutions

Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an

### Introduction to Machine Learning Midterm Exam Solutions

10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

### Introduction to Gaussian Process

Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

### Introduction to Machine Learning

1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer

### Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 22 Dan Weld Learning Gaussians Naïve Bayes Last Time Gaussians Naïve Bayes Logistic Regression Today Some slides from Carlos Guestrin, Luke Zettlemoyer

### CS489/698: Intro to ML

CS489/698: Intro to ML Lecture 04: Logistic Regression 1 Outline Announcements Baseline Learning Machine Learning Pyramid Regression or Classification (that s it!) History of Classification History of

### Investigating Models with Two or Three Categories

Ronald H. Heck and Lynn N. Tabata 1 Investigating Models with Two or Three Categories For the past few weeks we have been working with discriminant analysis. Let s now see what the same sort of model might

### Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels Jason D. M. Rennie Massachusetts Institute of Technology Comp. Sci. and Artificial Intelligence Laboratory Cambridge, MA 9,

### Data Preprocessing. Cluster Similarity

1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M

### Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

Aki Vehtari, Aalto University, Finland Probabilistic machine learning group, Aalto University http://research.cs.aalto.fi/pml/ Bayesian theory and methods, approximative integration, model assessment and

### Statistics 352: Spatial statistics. Jonathan Taylor. Department of Statistics. Models for discrete data. Stanford University.

352: 352: Models for discrete data April 28, 2009 1 / 33 Models for discrete data 352: Outline Dependent discrete data. Image data (binary). Ising models. Simulation: Gibbs sampling. Denoising. 2 / 33

### Bayesian Decision Theory

Bayesian Decision Theory Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Bayesian Decision Theory Bayesian classification for normal distributions Error Probabilities

### Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Week 10 Based in part on slides from textbook, slides of Susan Holmes Part I Linear regression & December 5, 2012 1 / 1 2 / 1 We ve talked mostly about classification, where the outcome categorical. If

### Bayesian Learning Features of Bayesian learning methods:

Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct. This provides a more

### y(x) = x w + ε(x), (1)

Linear regression We are ready to consider our first machine-learning problem: linear regression. Suppose that e are interested in the values of a function y(x): R d R, here x is a d-dimensional vector-valued

### Nearest Neighbors Methods for Support Vector Machines

Nearest Neighbors Methods for Support Vector Machines A. J. Quiroz, Dpto. de Matemáticas. Universidad de Los Andes joint work with María González-Lima, Universidad Simón Boĺıvar and Sergio A. Camelo, Universidad

### Cross-validation for detecting and preventing overfitting

Cross-validation for detecting and preventing overfitting A Regression Problem = f() + noise Can we learn f from this data? Note to other teachers and users of these slides. Andrew would be delighted if

### Midterm. You may use a calculator, but not any device that can access the Internet or store large amounts of data.

INST 737 April 1, 2013 Midterm Name: }{{} by writing my name I swear by the honor code Read all of the following information before starting the exam: For free response questions, show all work, clearly

### Hypothesis testing:power, test statistic CMS:

Hypothesis testing:power, test statistic The more sensitive the test, the better it can discriminate between the null and the alternative hypothesis, quantitatively, maximal power In order to achieve this

### Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

CS 189 Introduction to Machine Learning Fall 2017 Note 5 1 Overview Recall from our previous note that for a fixed input x, our measurement Y is a noisy measurement of the true underlying response f x):

### Midterm Exam Solutions, Spring 2007

1-71 Midterm Exam Solutions, Spring 7 1. Personal info: Name: Andrew account: E-mail address:. There should be 16 numbered pages in this exam (including this cover sheet). 3. You can use any material you

### ELM Classification: Postprocessing NBC

ELM Classification: Postprocessing by GMM or NBC Amaury Lendasse, Andrey Gritsenko, Emil Eirola, Yoan Miche and Kaj-Mikael Björk MIE Department and Informatics Initiative UIOWA - amaury-lendasse@uiowa.edu

### CS 6140: Machine Learning Spring 2016

CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa?on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Logis?cs Assignment

### Machine Learning. Regression. Manfred Huber

Machine Learning Regression Manfred Huber 2015 1 Regression Regression refers to supervised learning problems where the target output is one or more continuous values Continuous output values imply that

### SLAM Techniques and Algorithms. Jack Collier. Canada. Recherche et développement pour la défense Canada. Defence Research and Development Canada

SLAM Techniques and Algorithms Jack Collier Defence Research and Development Canada Recherche et développement pour la défense Canada Canada Goals What will we learn Gain an appreciation for what SLAM

### Multinomial Logistic Regression Models

Stat 544, Lecture 19 1 Multinomial Logistic Regression Models Polytomous responses. Logistic regression can be extended to handle responses that are polytomous, i.e. taking r>2 categories. (Note: The word

### CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

CS 446 Machine Learning Fall 206 Nov 0, 206 Bayesian Learning Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Bayesian Learning Naive Bayes Logistic Regression Bayesian Learning So far, we

### A Bias Correction for the Minimum Error Rate in Cross-validation

A Bias Correction for the Minimum Error Rate in Cross-validation Ryan J. Tibshirani Robert Tibshirani Abstract Tuning parameters in supervised learning problems are often estimated by cross-validation.

### Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

### CPSC 540: Machine Learning

CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

### Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

### BIOS 312: Precision of Statistical Inference

and Power/Sample Size and Standard Errors BIOS 312: of Statistical Inference Chris Slaughter Department of Biostatistics, Vanderbilt University School of Medicine January 3, 2013 Outline Overview and Power/Sample

### ANOVA, ANCOVA and MANOVA as sem

ANOVA, ANCOVA and MANOVA as sem Robin Beaumont 2017 Hoyle Chapter 24 Handbook of Structural Equation Modeling (2015 paperback), Examples converted to R and Onyx SEM diagrams. This workbook duplicates some

### Logistic Regression Trained with Different Loss Functions. Discussion

Logistic Regression Trained with Different Loss Functions Discussion CS640 Notations We restrict our discussions to the binary case. g(z) = g (z) = g(z) z h w (x) = g(wx) = + e z = g(z)( g(z)) + e wx =

### What s New in Econometrics. Lecture 13

What s New in Econometrics Lecture 13 Weak Instruments and Many Instruments Guido Imbens NBER Summer Institute, 2007 Outline 1. Introduction 2. Motivation 3. Weak Instruments 4. Many Weak) Instruments

### Chapter 26 Multiple Regression, Logistic Regression, and Indicator Variables

Chapter 26 Multiple Regression, Logistic Regression, and Indicator Variables 26.1 S 4 /IEE Application Examples: Multiple Regression An S 4 /IEE project was created to improve the 30,000-footlevel metric

### Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Bayesian Classifiers and Probability Estimation Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington 1 Data Space Suppose that we have a classification problem The

### REAL TIME DATA MINING

REAL TIME DATA MINING By Saed Sayad i REAL TIME DATA MINING Copyright 2011 Saed Sayad, All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted by any means,

### Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1

Inverse of a Square Matrix For an N N square matrix A, the inverse of A, 1 A, exists if and only if A is of full rank, i.e., if and only if no column of A is a linear combination 1 of the others. A is

### Chapter 1 - Lecture 3 Measures of Location

Chapter 1 - Lecture 3 of Location August 31st, 2009 Chapter 1 - Lecture 3 of Location General Types of measures Median Skewness Chapter 1 - Lecture 3 of Location Outline General Types of measures What

### Linear Classifiers and the Perceptron

Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers Let s assume that every instance is an n-dimensional vector of real numbers x R n, and there are only two possible

### 10.1 The Formal Model

67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate

### ECON3150/4150 Spring 2015

ECON3150/4150 Spring 2015 Lecture 3&4 - The linear regression model Siv-Elisabeth Skjelbred University of Oslo January 29, 2015 1 / 67 Chapter 4 in S&W Section 17.1 in S&W (extended OLS assumptions) 2

### Exam ECON3150/4150: Introductory Econometrics. 18 May 2016; 09:00h-12.00h.

Exam ECON3150/4150: Introductory Econometrics. 18 May 2016; 09:00h-12.00h. This is an open book examination where all printed and written resources, in addition to a calculator, are allowed. If you are

### Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1) Detection problems can usually be casted as binary or M-ary hypothesis testing problems. Applications: This chapter: Simple hypothesis

### Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Module 22: Bayesian Methods Lecture 9 A: Default prior selection Peter Hoff Departments of Statistics and Biostatistics University of Washington Outline Jeffreys prior Unit information priors Empirical

### Previous lecture. Single variant association. Use genome-wide SNPs to account for confounding (population substructure)

Previous lecture Single variant association Use genome-wide SNPs to account for confounding (population substructure) Estimation of effect size and winner s curse Meta-Analysis Today s outline P-value

### The definitions and notation are those introduced in the lectures slides. R Ex D [h

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 2 October 04, 2016 Due: October 18, 2016 A. Rademacher complexity The definitions and notation

### Statistics for the LHC Lecture 1: Introduction

Statistics for the LHC Lecture 1: Introduction Academic Training Lectures CERN, 14 17 June, 2010 indico.cern.ch/conferencedisplay.py?confid=77830 Glen Cowan Physics Department Royal Holloway, University

### Optimal Treatment Regimes for Survival Endpoints from a Classification Perspective. Anastasios (Butch) Tsiatis and Xiaofei Bai

Optimal Treatment Regimes for Survival Endpoints from a Classification Perspective Anastasios (Butch) Tsiatis and Xiaofei Bai Department of Statistics North Carolina State University 1/35 Optimal Treatment

### Chapter 5 Confidence Intervals

Chapter 5 Confidence Intervals Confidence Intervals about a Population Mean, σ, Known Abbas Motamedi Tennessee Tech University A point estimate: a single number, calculated from a set of data, that is

### From inductive inference to machine learning

From inductive inference to machine learning ADAPTED FROM AIMA SLIDES Russel&Norvig:Artificial Intelligence: a modern approach AIMA: Inductive inference AIMA: Inductive inference 1 Outline Bayesian inferences

### Testing Statistical Hypotheses

E.L. Lehmann Joseph P. Romano Testing Statistical Hypotheses Third Edition 4y Springer Preface vii I Small-Sample Theory 1 1 The General Decision Problem 3 1.1 Statistical Inference and Statistical Decisions

### PAC-learning, VC Dimension and Margin-based Bounds

More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

### PASS Sample Size Software. Poisson Regression

Chapter 870 Introduction Poisson regression is used when the dependent variable is a count. Following the results of Signorini (99), this procedure calculates power and sample size for testing the hypothesis

### Lecture 14 Simple Linear Regression

Lecture 4 Simple Linear Regression Ordinary Least Squares (OLS) Consider the following simple linear regression model where, for each unit i, Y i is the dependent variable (response). X i is the independent

### Example - basketball players and jockeys. We will keep practical applicability in mind:

Sonka: Pattern Recognition Class 1 INTRODUCTION Pattern Recognition (PR) Statistical PR Syntactic PR Fuzzy logic PR Neural PR Example - basketball players and jockeys We will keep practical applicability

### Regression I: Mean Squared Error and Measuring Quality of Fit

Regression I: Mean Squared Error and Measuring Quality of Fit -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD 1 The Setup Suppose there is a scientific problem we are interested in solving

### Y i = η + ɛ i, i = 1,...,n.

Nonparametric tests If data do not come from a normal population (and if the sample is not large), we cannot use a t-test. One useful approach to creating test statistics is through the use of rank statistics.

### An Information-Theoretic Measure of Intrusion Detection Capability

An Information-Theoretic Measure of Intrusion Detection Capability Guofei Gu, Prahlad Fogla, David Dagon, Wenke Lee College of Computing, Georgia Institute of Technology, Atlanta GA 3332 {guofei,prahlad,dagon,wenke}@cc.gatech.edu

### The Calibrated Bayes Factor for Model Comparison

The Calibrated Bayes Factor for Model Comparison Steve MacEachern The Ohio State University Joint work with Xinyi Xu, Pingbo Lu and Ruoxi Xu Supported by the NSF and NSA Bayesian Nonparametrics Workshop

### Simple Linear Regression for the MPG Data

Simple Linear Regression for the MPG Data 2000 2500 3000 3500 15 20 25 30 35 40 45 Wgt MPG What do we do with the data? y i = MPG of i th car x i = Weight of i th car i =1,...,n n = Sample Size Exploratory

### CHAPTER EVALUATING HYPOTHESES 5.1 MOTIVATION

CHAPTER EVALUATING HYPOTHESES Empirically evaluating the accuracy of hypotheses is fundamental to machine learning. This chapter presents an introduction to statistical methods for estimating hypothesis

### Machine Learning & Data Mining

Group M L D Machine Learning M & Data Mining Chapter 7 Decision Trees Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University Top 10 Algorithm in DM #1: C4.5 #2: K-Means #3: SVM

### Boosting: Foundations and Algorithms. Rob Schapire

Boosting: Foundations and Algorithms Rob Schapire Example: Spam Filtering problem: filter out spam (junk email) gather large collection of examples of spam and non-spam: From: yoav@ucsd.edu Rob, can you

### b) (1) Using the results of part (a), let Q be the matrix with column vectors b j and A be the matrix with column vectors v j :

Exercise assignment 2 Each exercise assignment has two parts. The first part consists of 3 5 elementary problems for a maximum of 10 points from each assignment. For the second part consisting of problems

### Effective Dimension Reduction Methods for Tumor Classification using Gene Expression Data

Effective Dimension Reduction Methods for Tumor Classification using Gene Expression Data A. Antoniadis *, S. Lambert-Lacroix and F. Leblanc Laboratoire IMAG-LMC, University Joseph Fourier BP 53, 38041

### Random Effects Models for Network Data

Random Effects Models for Network Data Peter D. Hoff 1 Working Paper no. 28 Center for Statistics and the Social Sciences University of Washington Seattle, WA 98195-4320 January 14, 2003 1 Department of

### Jelinek Summer School Machine Learning

Jelinek Summer School Machine Learning Department School of Computer Science Carnegie Mellon University Machine Learning Matt Gormley June 19, 2017 1 This section is based on Chapter 1 of (Mitchell, 1997)

### Why experimenters should not randomize, and what they should do instead

Why experimenters should not randomize, and what they should do instead Maximilian Kasy Department of Economics, Harvard University Maximilian Kasy (Harvard) Experimental design 1 / 42 project STAR Introduction

### Logistic Regression with the Nonnegative Garrote

Logistic Regression with the Nonnegative Garrote Enes Makalic Daniel F. Schmidt Centre for MEGA Epidemiology The University of Melbourne 24th Australasian Joint Conference on Artificial Intelligence 2011

### Regularized Discriminant Analysis and Its Application in Microarrays

Biostatistics (2005), 1, 1, pp. 1 18 Printed in Great Britain Regularized Discriminant Analysis and Its Application in Microarrays By YAQIAN GUO Department of Statistics, Stanford University Stanford,

### A Course in Applied Econometrics Lecture 18: Missing Data. Jeff Wooldridge IRP Lectures, UW Madison, August Linear model with IVs: y i x i u i,

A Course in Applied Econometrics Lecture 18: Missing Data Jeff Wooldridge IRP Lectures, UW Madison, August 2008 1. When Can Missing Data be Ignored? 2. Inverse Probability Weighting 3. Imputation 4. Heckman-Type

### Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models Jonathan Taylor - p. 1/15 Today s class Bias-Variance tradeoff. Penalized regression. Cross-validation. - p. 2/15 Bias-variance

### Model Selection Procedures

Model Selection Procedures Statistics 135 Autumn 2005 Copyright c 2005 by Mark E. Irwin Model Selection Procedures Consider a regression setting with K potential predictor variables and you wish to explore

### CPSC 540: Machine Learning

CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

### Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond January 18, 2017 Contents 1 Batch and Recursive Estimation 2 Towards Bayesian Filtering 3 Kalman Filter and Bayesian Filtering and Smoothing

### What is Predictive Modeling?

What is Predictive Modeling? Casualty Actuaries of the Northeast Spring 2005 Sturbridge, MA March 23, 2005 Presented by Christopher Monsour, FCAS, MAAA 2005 2004 Towers Perrin What is it? Estimation of

### Bayesian Networks Structure Learning (cont.)

Koller & Friedman Chapters (handed out): Chapter 11 (short) Chapter 1: 1.1, 1., 1.3 (covered in the beginning of semester) 1.4 (Learning parameters for BNs) Chapter 13: 13.1, 13.3.1, 13.4.1, 13.4.3 (basic

### Semiparametric Varying Coefficient Models for Matched Case-Crossover Studies

Semiparametric Varying Coefficient Models for Matched Case-Crossover Studies Ana Maria Ortega-Villa Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial

### Maximum Entropy and Species Distribution Modeling

Maximum Entropy and Species Distribution Modeling Rob Schapire Steven Phillips Miro Dudík Also including work by or with: Rob Anderson, Jane Elith, Catherine Graham, Chris Raxworthy, NCEAS Working Group,...

### Artificial Neural Networks Examination, June 2005

Artificial Neural Networks Examination, June 2005 Instructions There are SIXTY questions. (The pass mark is 30 out of 60). For each question, please select a maximum of ONE of the given answers (either

### Bayes Theorem & Diagnostic Tests Screening Tests

Bayes heore & Diagnostic ests Screening ests Box contains 2 red balls and blue ball Box 2 contains red ball and 3 blue balls A coin is tossed. If Head turns up a ball is drawn fro Box, and if ail turns

### Spatial Variation in Infant Mortality with Geographically Weighted Poisson Regression (GWPR) Approach

Spatial Variation in Infant Mortality with Geographically Weighted Poisson Regression (GWPR) Approach Kristina Pestaria Sinaga, Manuntun Hutahaean 2, Petrus Gea 3 1, 2, 3 University of Sumatera Utara,

### Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond Department of Biomedical Engineering and Computational Science Aalto University January 26, 2012 Contents 1 Batch and Recursive Estimation

### Multiple regression: Categorical dependent variables

Multiple : Categorical Johan A. Elkink School of Politics & International Relations University College Dublin 28 November 2016 1 2 3 4 Outline 1 2 3 4 models models have a variable consisting of two categories.

### Statistical Methods in Clinical Trials Categorical Data

Statistical Methods in Clinical Trials Categorical Data Types of Data quantitative Continuous Blood pressure Time to event Categorical sex qualitative Discrete No of relapses Ordered Categorical Pain level