Terminology for Statistical Data

Similar documents
9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Lecture 3: Statistical Decision Theory (Part II)

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Linear Methods for Classification

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Econometrics I. Lecture 10: Nonparametric Estimation with Kernels. Paul T. Scott NYU Stern. Fall 2018

Machine Learning Practice Page 2 of 2 10/28/13

CMSC858P Supervised Learning Methods

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Linear Models in Machine Learning

Overfitting, Bias / Variance Analysis

Lecture 3: Introduction to Complexity Regularization

Linear Decision Boundaries

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Lecture 02 Linear classification methods I

Classification 2: Linear discriminant analysis (continued); logistic regression

18.9 SUPPORT VECTOR MACHINES

Bayesian Linear Regression [DRAFT - In Progress]

Introduction to Machine Learning and Cross-Validation

Machine Learning, Fall 2009: Midterm

Notes on Discriminant Functions and Optimal Classification

COMS 4771 Regression. Nakul Verma

Machine Learning. Lecture 9: Learning Theory. Feng Li.

COMS 4771 Introduction to Machine Learning. Nakul Verma

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Introduction to Statistical modeling: handout for Math 489/583

Recap from previous lecture

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

STA414/2104 Statistical Methods for Machine Learning II

The Bayes classifier

Regression I: Mean Squared Error and Measuring Quality of Fit

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Announcements. Proposals graded

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

L11: Pattern recognition principles

CSE446: non-parametric methods Spring 2017

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

Linear and Logistic Regression. Dr. Xiaowei Huang

41903: Introduction to Nonparametrics

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Models Review

Introduction to Machine Learning Midterm Exam

Linear Regression (9/11/13)

CMU-Q Lecture 24:

Understanding Generalization Error: Bounds and Decompositions

Classification. Chapter Introduction. 6.2 The Bayes classifier

Machine Learning for OR & FE

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET. Questions AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET

PAC-learning, VC Dimension and Margin-based Bounds

Machine Learning for OR & FE

Assignment 2 (Sol.) Introduction to Machine Learning Prof. B. Ravindran

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear.

Day 5: Generative models, structured classification

STA 4273H: Statistical Machine Learning

STA 4273H: Sta-s-cal Machine Learning

STA 414/2104, Spring 2014, Practice Problem Set #1

Evaluation. Andrea Passerini Machine Learning. Evaluation

STK Statistical Learning: Advanced Regression and Classification

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

BAYESIAN DECISION THEORY

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Week 5: Logistic Regression & Neural Networks

Introduction to Machine Learning

10-701/ Machine Learning - Midterm Exam, Fall 2010

Support Vector Regression (SVR) Descriptions of SVR in this discussion follow that in Refs. (2, 6, 7, 8, 9). The literature

Bias-Variance Tradeoff. David Dalpiaz STAT 430, Fall 2017

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

LECTURE NOTE #3 PROF. ALAN YUILLE

Stat 5101 Lecture Notes

Evaluation requires to define performance measures to be optimized

COMP 875 Announcements

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

UNIVERSITETET I OSLO

CPSC 340: Machine Learning and Data Mining

Machine Learning and Data Mining. Linear regression. Kalev Kask

Machine Learning for OR & FE

Managing Uncertainty

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 78 le-tex

The Multivariate Gaussian Distribution [DRAFT]

Learning Objectives for Stat 225

Gaussian and Linear Discriminant Analysis; Multiclass Classification

CPSC 340: Machine Learning and Data Mining

PAC-learning, VC Dimension and Margin-based Bounds

Bagging During Markov Chain Monte Carlo for Smoother Predictions

VBM683 Machine Learning

Holdout and Cross-Validation Methods Overfitting Avoidance

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Bayesian Learning (II)

Introduction to Machine Learning Midterm Exam Solutions

Bayes rule and Bayes error. Donglin Zeng, Department of Biostatistics, University of North Carolina

Today. Calculus. Linear Regression. Lagrange Multipliers

Sparse Linear Models (10/7/13)

Transcription:

Terminology for Statistical Data variables - features - attributes observations - cases (consist of multiple values) In a standard data matrix, variables or features correspond to columns observations or cases correspond to rows. We think of variables or features as being either factors, inputs, independent variables or responses, outputs, dependent variables. Different areas of application tend to use different names.

Notation for Statistical Data X - dataset (matrix) or X - input variable Y - output variable G - variable indicating which group an observation is in. Often we distinguish random variables from realizations (or constants). X is random variable; x is realization. I will generally try to use the same notation as HTF.

Notation for Vectors I do not use a special notation for vectors. I usually use lower case for vectors and upper case for matrices. Vectors are always column vectors, though I write a vector in a single line: x = (3,2,6,1,8). Transpose is indicated by a superscript T; for example, x T x is the dot (inner) product (or quadratic form). x T Ax, where A is a symmetric matrix, is a general quadratic form.

Types of Data Numeric or nominal. Continuous or discrete (or categorical) Nominal, ordinal, linear, ratio.

Supervised Learning; Classification Given a dataset with some input features (factors. independent variables, etc.) and some output features (responses, groups or classes, dependent variables, etc.), see if we can figure out a rule that uses the attributes of a given observation to tell what group the observation is likely to be in. To do this, we may use part of the dataset as a training set and then see how good our rule is by applying it to the remainder test set. You can see how we might do this by using various subsets of the dataset as training or test sets.

Based on the observed values of x 1 and x 2, can we tell whether a point should be red or green? x 2 4 3 2 1 0 1 2 Group 0 Group 1 2 0 2 4 6 x 1

Go to R.

Linear Models If we have input variables X 1,..., X p and an output variable Y, a linear model that relates them is or in vector notation Y β 0 + p j=1 X j β j. Y X T β, where we ve put a constant 1 in the first position of the vector X. In the context of machine learning, the constant β 0 is sometimes called the bias.

Fitting Linear Models Using Estimates In statistical applications, we might assume such a model exists, and use data to estimate the β. We ll let β be the estimate β. Given a set of input variables, we ll predict the output as Ŷ = X T β, We often change the notation a little. Let y i be the response in the i th observation and let x i be the vector of inputs in the i th observation. y i x T i β.

Fitting Linear Models by Global Least Squares Given y i x T i β, we often estimate β by least squares: Find β so that n i=1 is minimized. This is an L 2 fit. (y i x T i β) 2 Of course we could also estimate β by other criteria applied to the residuals, such as least absolute values where n i=1 is minimized. This is an L 1 fit. y i x T i β

Alternate Notation for a Dataset in a Linear Model y is the n-vector of observations on output variable, and X is the n p + 1 matrix of corresponding observations on the input variables in which the i th row corresponds to the vector of input variables (plus the constant 1). y Xβ Least squares criterion to minimize: (y X β) T (y X β). Yields X T X β = X T y.

Now, let s return to our original problem. x 2 4 3 2 1 0 1 2 Group 0 Group 1 2 0 2 4 6 x 1

This was generated in a manner similar to the way the data in Figure 2.1 on page 13 was generated. (See description on page 12.) set.seed(555) p <- 2 nm <- 10 m1 <- matrix(rnorm(p*nm),ncol=p) m1[,1]<-m1[,1]+1 m2 <- matrix(rnorm(p*nm),ncol=p) m2[,2]<-m2[,2]+1 n1 <- 100 n2 <- 100 m1index <- sample(nm,n1,replace=t) X1 <- matrix(rnorm(p*n1),ncol=p)+m1[m1index,] m2index <- sample(nm,n2,replace=t) X2 <- matrix(rnorm(p*n2),ncol=p)+m2[m2index,] plot(x1[,1],x1[,2],col=2, xlab=expression(italic(x)[1]),ylab=expression(italic(x)[2])) points(x2[,1],x2[,2],col=3) legend("topright",legend=c("group 0","Group 1"),pch=c(1,1),col=c(2,3))

Would a linear model form a useful discrimator between the groups? Let s first put the data together in a single dataset with a group variable G = 0 if in first group and G = 1 if in second group and fit a linear regression with G as the dependent variable. (Often (0,1) is a more indicator pair than is ( 1,1), because we can relate it easily to a Bernoulli probability.) Fit y = β 0 + β 1 x 1 + β 2 x 2 and take Ĝ = 1 if ŷ > 0.5. If β 2 0, the intersection with this plane and the y = 0.5 plane is the line x 2 = ( β 0 + 0.5)β 2 β 1 /β 2 x 1, so we can draw it on our plot, which is a projection of the 3-space onto the y = 0.5 plane. (It is not the x 1 -x 2 plane itself.)

ex1 <- data.frame(x1=c(x1[,1],x2[,1]), x2=c(x1[,2],x2[,2]), G=c(rep(0,n1),rep(1,n2))) attach(ex1) fit2 <- lm(g~x1+x2) plot(x1,x2,col=g+2, xlab=expression(italic(x)[1]),ylab=expression(italic(x)[2])) b0=fit2$coef[1];b1=fit2$coef[2];b2=fit2$coef[3] if (abs(b2)>.machine$double.eps){ abline((-b0+0.5)/b2, -b1/b2) npts <- 50 v1 <- min(x1)+(1:npts)*((max(x1)-min(x1))/npts) v2 <- min(x2)+(1:npts)*((max(x2)-min(x2))/npts) for (i in 1:npts) points(v1,rep(v2[i],npts),pch=".", col=(v2[i]>(-b0+0.5)/b2-(b1/b2)*v1)+2) }

Ĝ = 1 (green) above the line and Ĝ = 0 below the line. x 2 4 3 2 1 0 1 2 3 2 0 2 4 6 x 1

The linear classifier does not do a very good job. Just eyeballing the problem, we can see that no single line could separate the points. Are there an equal number of reds above the line as greens below the line? This is equivalent to the question whether the number of positive residuals is almost the same as the number of negative residuals. In least squares fitting, this is not necessarily the case, especially in the (more usual) case in which the response is a continuous variable.

Even in the case of a dicotomous response variable, the number above may not be the same as the number below. In this example, however, sum(fit2$residuals>0) is 101 (essentially half); so it is not an issue in this case. We can insure the number above is almost the same as the number below by doing an L 1 fit: library(quantreg) fit1 <- rq(g~x1+x2)

Unknown or Unused Features Often an additional input variable could allow a much better classification. As scientists, that should be the first thing we think about. The more variables we have the more degrees of freedom for fitting that we have. In statistical data analysis, we speak of the degrees of freedom for fitting or the model degrees of freedom, and the leftover residual degrees of freedom or the degrees of freedom for error.

Unknown or Unused Features At the end of the day, however, the data scientist must accept what is given, and extract the most useful knowledge from that. So let s continue investigating the possibilities given only the inputs x 1 and x 2.

Nonlinear Global Classifiers Could a polynomial model work better? y β 00 +β 10 x 1 + +β p0 x p 1 +β 01x 2 + +β 0p x p 2 +β 11x 1 x 2 + +β pp x p 1 xp 1 work better? Possibly. (A polynomial model is still a linear model.) Could a generalized linear model work better? E(Y ) e βt x Possibly. ( logistic model, e.g.) Could a general nonlinear model work better? y f(β, x) Possibly. (What form for the function f?) The model complexity can increase the degrees of freedom for fitting that we have.

Local Fitting: Use of Nearest Neighbors Use only the nearest k neighbors to predict. The first question is how many nearest neighbors? (See Figures 2.2 and 2.3 in text.) Note that the more nearest neighbors that we use, the smoother the fit becomes. (The separating line becomes less smooth, however, note the jagged line in Figure 2.2.) Variations on the use of nearest neighbors include kernel methods, local weighting, and also use of local parametric models.

Degrees of Freedom In nonparametric or semiparametric procedures, we speak of the effective degrees of freedom for fitting. The effective degrees of freedom for fitting may also depend on the sample size. In the case of simple nearest neighbor fitting, the effective model degrees of freedom increases as the number of nearest neighbors decreases.

Fitting Based on Local Probabilities Statistical methods are delevoped in the context of a family of probability models. The family can be strong (very specific models) or weak (a wide range of models). If we have a model for the probabilities of each of the classes at each point in the space of inputs, there is a straightforward method of classification.

Statistical Decision Theory Define a loss function. Usually based on errors; the MSE is a common loss function. In classification problems, the 0-1 loss function is a logical choice. If we have a probability model, we choose a method that minimizes the expected value of the loss (risk). In a classification problem, we also consider the prediction error, and its espected value, EPE.

A Probability Model for Classification Suppose we have K groups, G = {G 1,..., G K }. (These are sometimes called targets in the classification problem.) Suppose we have a random variable G whose support is G. Suppose we have an observable random variable X with support X. Suppose for each k {1,..., K} and x X, we have a model that gives Pr(G = G k X = x). That last supposition is a lot!! But let s proceed.

A Probability Model for Classification and the Bayes Classifier Given x and the prior assumed probability distribution Pr(G = G k X = x), we want Ĝ(x) that is optimal (in some way). Take a 0-1 loss: L(G, Ĝ(x)) = { 0 if G = Ĝ(x) 1 if G Ĝ(x). (This is just a little strange for a frequentist statistician, because G is a random variable. In a Bayesian context, howver, it is the usual formulation.) We choose Ĝ(x) to minimize the risk (the expected value of the loss). The optimal choice under the 0-1 loss is Ĝ(x) = argmax G k G This is called the Bayes classifier. Pr(G = G k X = x).

Higher Dimensions For higher dimensions, we use projections. Strange things can happen in higher dimensions. Everything becomes an outlier.

Statistical Models Various statistical models. Various methods of fitting a model.

Restricted and Regularized Estimators Use local weighting - kernel regression. Add a penalty to the criterion; E.g. regularized least squares: (y X β) T (y X β) + λf( β). λ is tuning parameter. PRESS

Model Selection May introduce bias and/or variance. Use of cross-validation.

Variance/Bias Tradeoff MSE = variance + bias-squared. More smoothing yields more bias. Less smoothing yields more variance.

Studying Statistical Methods by Simulation Monte Carlo simulation methods are used to compare methods.