Terminology for Statistical Data variables - features - attributes observations - cases (consist of multiple values) In a standard data matrix, variables or features correspond to columns observations or cases correspond to rows. We think of variables or features as being either factors, inputs, independent variables or responses, outputs, dependent variables. Different areas of application tend to use different names.
Notation for Statistical Data X - dataset (matrix) or X - input variable Y - output variable G - variable indicating which group an observation is in. Often we distinguish random variables from realizations (or constants). X is random variable; x is realization. I will generally try to use the same notation as HTF.
Notation for Vectors I do not use a special notation for vectors. I usually use lower case for vectors and upper case for matrices. Vectors are always column vectors, though I write a vector in a single line: x = (3,2,6,1,8). Transpose is indicated by a superscript T; for example, x T x is the dot (inner) product (or quadratic form). x T Ax, where A is a symmetric matrix, is a general quadratic form.
Types of Data Numeric or nominal. Continuous or discrete (or categorical) Nominal, ordinal, linear, ratio.
Supervised Learning; Classification Given a dataset with some input features (factors. independent variables, etc.) and some output features (responses, groups or classes, dependent variables, etc.), see if we can figure out a rule that uses the attributes of a given observation to tell what group the observation is likely to be in. To do this, we may use part of the dataset as a training set and then see how good our rule is by applying it to the remainder test set. You can see how we might do this by using various subsets of the dataset as training or test sets.
Based on the observed values of x 1 and x 2, can we tell whether a point should be red or green? x 2 4 3 2 1 0 1 2 Group 0 Group 1 2 0 2 4 6 x 1
Go to R.
Linear Models If we have input variables X 1,..., X p and an output variable Y, a linear model that relates them is or in vector notation Y β 0 + p j=1 X j β j. Y X T β, where we ve put a constant 1 in the first position of the vector X. In the context of machine learning, the constant β 0 is sometimes called the bias.
Fitting Linear Models Using Estimates In statistical applications, we might assume such a model exists, and use data to estimate the β. We ll let β be the estimate β. Given a set of input variables, we ll predict the output as Ŷ = X T β, We often change the notation a little. Let y i be the response in the i th observation and let x i be the vector of inputs in the i th observation. y i x T i β.
Fitting Linear Models by Global Least Squares Given y i x T i β, we often estimate β by least squares: Find β so that n i=1 is minimized. This is an L 2 fit. (y i x T i β) 2 Of course we could also estimate β by other criteria applied to the residuals, such as least absolute values where n i=1 is minimized. This is an L 1 fit. y i x T i β
Alternate Notation for a Dataset in a Linear Model y is the n-vector of observations on output variable, and X is the n p + 1 matrix of corresponding observations on the input variables in which the i th row corresponds to the vector of input variables (plus the constant 1). y Xβ Least squares criterion to minimize: (y X β) T (y X β). Yields X T X β = X T y.
Now, let s return to our original problem. x 2 4 3 2 1 0 1 2 Group 0 Group 1 2 0 2 4 6 x 1
This was generated in a manner similar to the way the data in Figure 2.1 on page 13 was generated. (See description on page 12.) set.seed(555) p <- 2 nm <- 10 m1 <- matrix(rnorm(p*nm),ncol=p) m1[,1]<-m1[,1]+1 m2 <- matrix(rnorm(p*nm),ncol=p) m2[,2]<-m2[,2]+1 n1 <- 100 n2 <- 100 m1index <- sample(nm,n1,replace=t) X1 <- matrix(rnorm(p*n1),ncol=p)+m1[m1index,] m2index <- sample(nm,n2,replace=t) X2 <- matrix(rnorm(p*n2),ncol=p)+m2[m2index,] plot(x1[,1],x1[,2],col=2, xlab=expression(italic(x)[1]),ylab=expression(italic(x)[2])) points(x2[,1],x2[,2],col=3) legend("topright",legend=c("group 0","Group 1"),pch=c(1,1),col=c(2,3))
Would a linear model form a useful discrimator between the groups? Let s first put the data together in a single dataset with a group variable G = 0 if in first group and G = 1 if in second group and fit a linear regression with G as the dependent variable. (Often (0,1) is a more indicator pair than is ( 1,1), because we can relate it easily to a Bernoulli probability.) Fit y = β 0 + β 1 x 1 + β 2 x 2 and take Ĝ = 1 if ŷ > 0.5. If β 2 0, the intersection with this plane and the y = 0.5 plane is the line x 2 = ( β 0 + 0.5)β 2 β 1 /β 2 x 1, so we can draw it on our plot, which is a projection of the 3-space onto the y = 0.5 plane. (It is not the x 1 -x 2 plane itself.)
ex1 <- data.frame(x1=c(x1[,1],x2[,1]), x2=c(x1[,2],x2[,2]), G=c(rep(0,n1),rep(1,n2))) attach(ex1) fit2 <- lm(g~x1+x2) plot(x1,x2,col=g+2, xlab=expression(italic(x)[1]),ylab=expression(italic(x)[2])) b0=fit2$coef[1];b1=fit2$coef[2];b2=fit2$coef[3] if (abs(b2)>.machine$double.eps){ abline((-b0+0.5)/b2, -b1/b2) npts <- 50 v1 <- min(x1)+(1:npts)*((max(x1)-min(x1))/npts) v2 <- min(x2)+(1:npts)*((max(x2)-min(x2))/npts) for (i in 1:npts) points(v1,rep(v2[i],npts),pch=".", col=(v2[i]>(-b0+0.5)/b2-(b1/b2)*v1)+2) }
Ĝ = 1 (green) above the line and Ĝ = 0 below the line. x 2 4 3 2 1 0 1 2 3 2 0 2 4 6 x 1
The linear classifier does not do a very good job. Just eyeballing the problem, we can see that no single line could separate the points. Are there an equal number of reds above the line as greens below the line? This is equivalent to the question whether the number of positive residuals is almost the same as the number of negative residuals. In least squares fitting, this is not necessarily the case, especially in the (more usual) case in which the response is a continuous variable.
Even in the case of a dicotomous response variable, the number above may not be the same as the number below. In this example, however, sum(fit2$residuals>0) is 101 (essentially half); so it is not an issue in this case. We can insure the number above is almost the same as the number below by doing an L 1 fit: library(quantreg) fit1 <- rq(g~x1+x2)
Unknown or Unused Features Often an additional input variable could allow a much better classification. As scientists, that should be the first thing we think about. The more variables we have the more degrees of freedom for fitting that we have. In statistical data analysis, we speak of the degrees of freedom for fitting or the model degrees of freedom, and the leftover residual degrees of freedom or the degrees of freedom for error.
Unknown or Unused Features At the end of the day, however, the data scientist must accept what is given, and extract the most useful knowledge from that. So let s continue investigating the possibilities given only the inputs x 1 and x 2.
Nonlinear Global Classifiers Could a polynomial model work better? y β 00 +β 10 x 1 + +β p0 x p 1 +β 01x 2 + +β 0p x p 2 +β 11x 1 x 2 + +β pp x p 1 xp 1 work better? Possibly. (A polynomial model is still a linear model.) Could a generalized linear model work better? E(Y ) e βt x Possibly. ( logistic model, e.g.) Could a general nonlinear model work better? y f(β, x) Possibly. (What form for the function f?) The model complexity can increase the degrees of freedom for fitting that we have.
Local Fitting: Use of Nearest Neighbors Use only the nearest k neighbors to predict. The first question is how many nearest neighbors? (See Figures 2.2 and 2.3 in text.) Note that the more nearest neighbors that we use, the smoother the fit becomes. (The separating line becomes less smooth, however, note the jagged line in Figure 2.2.) Variations on the use of nearest neighbors include kernel methods, local weighting, and also use of local parametric models.
Degrees of Freedom In nonparametric or semiparametric procedures, we speak of the effective degrees of freedom for fitting. The effective degrees of freedom for fitting may also depend on the sample size. In the case of simple nearest neighbor fitting, the effective model degrees of freedom increases as the number of nearest neighbors decreases.
Fitting Based on Local Probabilities Statistical methods are delevoped in the context of a family of probability models. The family can be strong (very specific models) or weak (a wide range of models). If we have a model for the probabilities of each of the classes at each point in the space of inputs, there is a straightforward method of classification.
Statistical Decision Theory Define a loss function. Usually based on errors; the MSE is a common loss function. In classification problems, the 0-1 loss function is a logical choice. If we have a probability model, we choose a method that minimizes the expected value of the loss (risk). In a classification problem, we also consider the prediction error, and its espected value, EPE.
A Probability Model for Classification Suppose we have K groups, G = {G 1,..., G K }. (These are sometimes called targets in the classification problem.) Suppose we have a random variable G whose support is G. Suppose we have an observable random variable X with support X. Suppose for each k {1,..., K} and x X, we have a model that gives Pr(G = G k X = x). That last supposition is a lot!! But let s proceed.
A Probability Model for Classification and the Bayes Classifier Given x and the prior assumed probability distribution Pr(G = G k X = x), we want Ĝ(x) that is optimal (in some way). Take a 0-1 loss: L(G, Ĝ(x)) = { 0 if G = Ĝ(x) 1 if G Ĝ(x). (This is just a little strange for a frequentist statistician, because G is a random variable. In a Bayesian context, howver, it is the usual formulation.) We choose Ĝ(x) to minimize the risk (the expected value of the loss). The optimal choice under the 0-1 loss is Ĝ(x) = argmax G k G This is called the Bayes classifier. Pr(G = G k X = x).
Higher Dimensions For higher dimensions, we use projections. Strange things can happen in higher dimensions. Everything becomes an outlier.
Statistical Models Various statistical models. Various methods of fitting a model.
Restricted and Regularized Estimators Use local weighting - kernel regression. Add a penalty to the criterion; E.g. regularized least squares: (y X β) T (y X β) + λf( β). λ is tuning parameter. PRESS
Model Selection May introduce bias and/or variance. Use of cross-validation.
Variance/Bias Tradeoff MSE = variance + bias-squared. More smoothing yields more bias. Less smoothing yields more variance.
Studying Statistical Methods by Simulation Monte Carlo simulation methods are used to compare methods.