Overview of Supervised Learning

Size: px
Start display at page:

Download "Overview of Supervised Learning"

Transcription

1 2 Overview f Supervised Learning 2.1 Intrductin The first three examples described in Chapter 1 have several cmpnents in cmmn. Fr each there is a set f variables that might be dented as inputs, which are measured r preset. These have sme influence n ne r mre utputs. Fr each example the gal is t use the inputs t predict the values f the utputs. This exercise is called supervised learning. We have used the mre mdern language f machine learning. In the statistical literature the inputs are ften called the predictrs, a term we will use interchangeably with inputs, and mre classically the independent variables. In the pattern recgnitin literature the term features is preferred, which we use as well. The utputs are called the respnses, r classically the dependent variables. 2.2 Variable Types and Terminlgy The utputs vary in nature amng the examples. In the glucse predictin example, the utput is a quantitative measurement, where sme measurements are bigger than thers, and measurements clse in value are clse in nature. In the famus Iris discriminatin example due t R. A. Fisher, the utput is qualitative (species f Iris) and assumes values in a finite set G = {Virginica, Setsa and Versiclr}. In the handwritten digit example the utput is ne f 10 different digit classes: G = {0, 1,...,9}. In bth f T. Hastie et al., The Elements f Statistical Learning, Secnd Editin, 9 DOI: /b94608_2, Springer Science+Business Media, LLC 2009

2 10 2. Overview f Supervised Learning these there is n explicit rdering in the classes, and in fact ften descriptive labels rather than numbers are used t dente the classes. Qualitative variables are als referred t as categrical r discrete variables as well as factrs. Fr bth types f utputs it makes sense t think f using the inputs t predict the utput. Given sme specific atmspheric measurements tday and yesterday, we want t predict the zne level tmrrw. Given the grayscale values fr the pixels f the digitized image f the handwritten digit, we want t predict its class label. This distinctin in utput type has led t a naming cnventin fr the predictin tasks: regressin when we predict quantitative utputs, and classificatin when we predict qualitative utputs. We will see that these tw tasks have a lt in cmmn, and in particular bth can be viewed as a task in functin apprximatin. Inputs als vary in measurement type; we can have sme f each f qualitative and quantitative input variables. These have als led t distinctins in the types f methds that are used fr predictin: sme methds are defined mst naturally fr quantitative inputs, sme mst naturally fr qualitative and sme fr bth. A third variable type is rdered categrical, such as small, medium and large, where there is an rdering between the values, but n metric ntin is apprpriate (the difference between medium and small need nt be the same as that between large and medium). These are discussed further in Chapter 4. Qualitative variables are typically represented numerically by cdes. The easiest case is when there are nly tw classes r categries, such as success r failure, survived r died. These are ften represented by a single binary digit r bit as 0 r 1, r else by 1 and 1. Fr reasns that will becme apparent, such numeric cdes are smetimes referred t as targets. When there are mre than tw categries, several alternatives are available. The mst useful and cmmnly used cding is via dummy variables. Here a K-level qualitative variable is represented by a vectr f K binary variables r bits, nly ne f which is n at a time. Althugh mre cmpact cding schemes are pssible, dummy variables are symmetric in the levels f the factr. We will typically dente an input variable by the symbl X. If X is a vectr, its cmpnents can be accessed by subscripts X j. Quantitative utputs will be dented by Y, and qualitative utputs by G (fr grup). We use uppercase letters such as X, Y r G when referring t the generic aspects f a variable. Observed values are written in lwercase; hence the ith bserved value f X is written as x i (where x i is again a scalar r vectr). Matrices are represented by bld uppercase letters; fr example, a set f N input p-vectrs x i,i=1,...,n wuld be represented by the N p matrix X. In general, vectrs will nt be bld, except when they have N cmpnents; this cnventin distinguishes a p-vectr f inputs x i fr the

3 2.3 Least Squares and Nearest Neighbrs 11 ith bservatin frm the N-vectr x j cnsisting f all the bservatins n variable X j. Since all vectrs are assumed t be clumn vectrs, the ith rw f X is x T i, the vectr transpse f x i. Fr the mment we can lsely state the learning task as fllws: given the value f an input vectr X, make a gd predictin f the utput Y, dented by Ŷ (prnunced y-hat ). If Y takes values in IR then s shuld Ŷ ; likewise fr categrical utputs, Ĝ shuld take values in the same set G assciated with G. Fr a tw-class G, ne apprach is t dente the binary cded target as Y, and then treat it as a quantitative utput. The predictins Ŷ will typically lie in [0, 1], and we can assign t Ĝ the class label accrding t whether ŷ>0.5. This apprach generalizes t K-level qualitative utputs as well. We need data t cnstruct predictin rules, ften a lt f it. We thus suppse we have available a set f measurements (x i,y i )r(x i,g i ),i = 1,...,N, knwn as the training data, with which t cnstruct ur predictin rule. 2.3 Tw Simple Appraches t Predictin: Least Squares and Nearest Neighbrs In this sectin we develp tw simple but pwerful predictin methds: the linear mdel fit by least squares and the k-nearest-neighbr predictin rule. The linear mdel makes huge assumptins abut structure and yields stable but pssibly inaccurate predictins. The methd f k-nearest neighbrs makes very mild structural assumptins: its predictins are ften accurate but can be unstable Linear Mdels and Least Squares The linear mdel has been a mainstay f statistics fr the past 30 years and remains ne f ur mst imprtant tls. Given a vectr f inputs X T =(X 1,X 2,...,X p ), we predict the utput Y via the mdel Ŷ = ˆβ 0 + p X j ˆβj. (2.1) The term ˆβ 0 is the intercept, als knwn as the bias in machine learning. Often it is cnvenient t include the cnstant variable 1 in X, include ˆβ 0 in the vectr f cefficients ˆβ, and then write the linear mdel in vectr frm as an inner prduct Ŷ = X T ˆβ, (2.2) j=1

4 12 2. Overview f Supervised Learning where X T dentes vectr r matrix transpse (X being a clumn vectr). Here we are mdeling a single utput, s Ŷ is a scalar; in general Ŷ can be a K vectr, in which case β wuld be a p K matrix f cefficients. In the (p + 1)-dimensinal input utput space, (X, Ŷ ) represents a hyperplane. If the cnstant is included in X, then the hyperplane includes the rigin and is a subspace; if nt, it is an affine set cutting the Y -axis at the pint (0, ˆβ 0 ). Frm nw n we assume that the intercept is included in ˆβ. Viewed as a functin ver the p-dimensinal input space, f(x) =X T β is linear, and the gradient f (X) =β is a vectr in input space that pints in the steepest uphill directin. Hw d we fit the linear mdel t a set f training data? There are many different methds, but by far the mst ppular is the methd f least squares. In this apprach, we pick the cefficients β t minimize the residual sum f squares RSS(β) = N (y i x T i β) 2. (2.3) i=1 RSS(β) is a quadratic functin f the parameters, and hence its minimum always exists, but may nt be unique. The slutin is easiest t characterize in matrix ntatin. We can write RSS(β) =(y Xβ) T (y Xβ), (2.4) where X is an N p matrix with each rw an input vectr, and y is an N-vectr f the utputs in the training set. Differentiating w.r.t. β we get the nrmal equatins X T (y Xβ) =0. (2.5) If X T X is nnsingular, then the unique slutin is given by ˆβ =(X T X) 1 X T y, (2.6) and the fitted value at the ith input x i is ŷ i =ŷ(x i )=x T ˆβ. i Atanarbitrary input x 0 the predictin is ŷ(x 0 )=x T ˆβ. 0 The entire fitted surface is characterized by the p parameters ˆβ. Intuitively, it seems that we d nt need a very large data set t fit such a mdel. Let s lk at an example f the linear mdel in a classificatin cntext. Figure 2.1 shws a scatterplt f training data n a pair f inputs X 1 and X 2. The data are simulated, and fr the present the simulatin mdel is nt imprtant. The utput class variable G has the values BLUE r ORANGE, and is represented as such in the scatterplt. There are 100 pints in each f the tw classes. The linear regressin mdel was fit t these data, with the respnse Y cded as 0 fr BLUE and 1 fr ORANGE. The fitted values Ŷ are cnverted t a fitted class variable Ĝ accrding t the rule { ORANGE if Ĝ = Ŷ>0.5, BLUE if Ŷ 0.5. (2.7)

5 2.3 Least Squares and Nearest Neighbrs 13 Linear Regressin f 0/1 Respnse FIGURE 2.1. A classificatin example in tw dimensins. The classes are cded as a binary variable (BLUE =0, ORANGE =1), and then fit by linear regressin. The line is the decisin bundary defined by x T ˆβ =0.5. The range shaded regin dentes that part f input space classified as ORANGE, while the blue regin is classified as BLUE. The set f pints in IR 2 classified as ORANGE crrespnds t {x : x T ˆβ >0.5}, indicated in Figure 2.1, and the tw predicted classes are separated by the decisin bundary {x : x T ˆβ =0.5}, which is linear in this case. We see that fr these data there are several misclassificatins n bth sides f the decisin bundary. Perhaps ur linear mdel is t rigid r are such errrs unavidable? Remember that these are errrs n the training data itself, and we have nt said where the cnstructed data came frm. Cnsider the tw pssible scenaris: Scenari 1: The training data in each class were generated frm bivariate Gaussian distributins with uncrrelated cmpnents and different means. Scenari 2: The training data in each class came frm a mixture f 10 lwvariance Gaussian distributins, with individual means themselves distributed as Gaussian. A mixture f Gaussians is best described in terms f the generative mdel. One first generates a discrete variable that determines which f

6 14 2. Overview f Supervised Learning the cmpnent Gaussians t use, and then generates an bservatin frm the chsen density. In the case f ne Gaussian per class, we will see in Chapter 4 that a linear decisin bundary is the best ne can d, and that ur estimate is almst ptimal. The regin f verlap is inevitable, and future data t be predicted will be plagued by this verlap as well. In the case f mixtures f tightly clustered Gaussians the stry is different. A linear decisin bundary is unlikely t be ptimal, and in fact is nt. The ptimal decisin bundary is nnlinear and disjint, and as such will be much mre difficult t btain. We nw lk at anther classificatin and regressin prcedure that is in sme sense at the ppsite end f the spectrum t the linear mdel, and far better suited t the secnd scenari Nearest-Neighbr Methds Nearest-neighbr methds use thse bservatins in the training set T clsest in input space t x t frm Ŷ. Specifically, the k-nearest neighbr fit fr Ŷ is defined as fllws: Ŷ (x) = 1 k x i N k (x) y i, (2.8) where N k (x) is the neighbrhd f x defined by the k clsest pints x i in the training sample. Clseness implies a metric, which fr the mment we assume is Euclidean distance. S, in wrds, we find the k bservatins with x i clsest t x in input space, and average their respnses. In Figure 2.2 we use the same training data as in Figure 2.1, and use 15-nearest-neighbr averaging f the binary cded respnse as the methd f fitting. Thus Ŷ is the prprtin f ORANGE s in the neighbrhd, and s assigning class ORANGE t Ĝ if Ŷ > 0.5 amunts t a majrity vte in the neighbrhd. The clred regins indicate all thse pints in input space classified as BLUE r ORANGE by such a rule, in this case fund by evaluating the prcedure n a fine grid in input space. We see that the decisin bundaries that separate the BLUE frm the ORANGE regins are far mre irregular, and respnd t lcal clusters where ne class dminates. Figure 2.3 shws the results fr 1-nearest-neighbr classificatin: Ŷ is assigned the value y l f the clsest pint x l t x in the training data. In this case the regins f classificatin can be cmputed relatively easily, and crrespnd t a Vrni tessellatin f the training data. Each pint x i has an assciated tile bunding the regin fr which it is the clsest input pint. Fr all pints x in the tile, Ĝ(x) =g i. The decisin bundary is even mre irregular than befre. The methd f k-nearest-neighbr averaging is defined in exactly the same way fr regressin f a quantitative utput Y, althugh k = 1 wuld be an unlikely chice.

7 2.3 Least Squares and Nearest Neighbrs Nearest Neighbr Classifier FIGURE 2.2. The same classificatin example in tw dimensins as in Figure 2.1. The classes are cded as a binary variable (BLUE =0, ORANGE =1)and then fit by 15-nearest-neighbr averaging as in (2.8). The predicted class is hence chsen by majrity vte amngst the 15-nearest neighbrs. In Figure 2.2 we see that far fewer training bservatins are misclassified than in Figure 2.1. This shuld nt give us t much cmfrt, thugh, since in Figure 2.3 nne f the training data are misclassified. A little thught suggests that fr k-nearest-neighbr fits, the errr n the training data shuld be apprximately an increasing functin f k, and will always be 0 fr k = 1. An independent test set wuld give us a mre satisfactry means fr cmparing the different methds. It appears that k-nearest-neighbr fits have a single parameter, the number f neighbrs k, cmpared t the p parameters in least-squares fits. Althugh this is the case, we will see that the effective number f parameters f k-nearest neighbrs is N/k and is generally bigger than p, and decreases with increasing k. T get an idea f why, nte that if the neighbrhds were nnverlapping, there wuld be N/k neighbrhds and we wuld fit ne parameter (a mean) in each neighbrhd. It is als clear that we cannt use sum-f-squared errrs n the training set as a criterin fr picking k, since we wuld always pick k = 1! It wuld seem that k-nearest-neighbr methds wuld be mre apprpriate fr the mixture Scenari 2 described abve, while fr Gaussian data the decisin bundaries f k-nearest neighbrs wuld be unnecessarily nisy.

8 16 2. Overview f Supervised Learning 1-Nearest Neighbr Classifier FIGURE 2.3. The same classificatin example in tw dimensins as in Figure 2.1. The classes are cded as a binary variable (BLUE =0, ORANGE =1),and then predicted by 1-nearest-neighbr classificatin Frm Least Squares t Nearest Neighbrs The linear decisin bundary frm least squares is very smth, and apparently stable t fit. It des appear t rely heavily n the assumptin that a linear decisin bundary is apprpriate. In language we will develp shrtly, it has lw variance and ptentially high bias. On the ther hand, the k-nearest-neighbr prcedures d nt appear t rely n any stringent assumptins abut the underlying data, and can adapt t any situatin. Hwever, any particular subregin f the decisin bundary depends n a handful f input pints and their particular psitins, and is thus wiggly and unstable high variance and lw bias. Each methd has its wn situatins fr which it wrks best; in particular linear regressin is mre apprpriate fr Scenari 1 abve, while nearest neighbrs are mre suitable fr Scenari 2. The time has cme t expse the racle! The data in fact were simulated frm a mdel smewhere between the tw, but clser t Scenari 2. First we generated 10 means m k frm a bivariate Gaussian distributin N((1, 0) T, I) and labeled this class BLUE. Similarly, 10 mre were drawn frm N((0, 1) T, I) and labeled class ORANGE. Then fr each class we generated 100 bservatins as fllws: fr each bservatin, we picked an m k at randm with prbability 1/10, and

9 2.3 Least Squares and Nearest Neighbrs 17 k Number f Nearest Neighbrs Test Errr Train Test Bayes Linear Degrees f Freedm N/k FIGURE 2.4. Misclassificatin curves fr the simulatin example used in Figures 2.1, 2.2 and 2.3. A single training sample f size 200 was used, and a test sample f size 10, 000. The range curves are test and the blue are training errr fr k-nearest-neighbr classificatin. The results fr linear regressin are the bigger range and blue squares at three degrees f freedm. The purple line is the ptimal Bayes errr rate. then generated a N(m k, I/5), thus leading t a mixture f Gaussian clusters fr each class. Figure 2.4 shws the results f classifying 10,000 new bservatins generated frm the mdel. We cmpare the results fr least squares and thse fr k-nearest neighbrs fr a range f values f k. A large subset f the mst ppular techniques in use tday are variants f these tw simple prcedures. In fact 1-nearest-neighbr, the simplest f all, captures a large percentage f the market fr lw-dimensinal prblems. The fllwing list describes sme ways in which these simple prcedures have been enhanced: Kernel methds use weights that decrease smthly t zer with distance frm the target pint, rather than the effective 0/1 weights used by k-nearest neighbrs. In high-dimensinal spaces the distance kernels are mdified t emphasize sme variable mre than thers.

10 18 2. Overview f Supervised Learning Lcal regressin fits linear mdels by lcally weighted least squares, rather than fitting cnstants lcally. Linear mdels fit t a basis expansin f the riginal inputs allw arbitrarily cmplex mdels. Prjectin pursuit and neural netwrk mdels cnsist f sums f nnlinearly transfrmed linear mdels. 2.4 Statistical Decisin Thery In this sectin we develp a small amunt f thery that prvides a framewrk fr develping mdels such as thse discussed infrmally s far. We first cnsider the case f a quantitative utput, and place urselves in the wrld f randm variables and prbability spaces. Let X IR p dente a real valued randm input vectr, and Y IR a real valued randm utput variable, with jint distributin Pr(X, Y ). We seek a functin f(x) fr predicting Y given values f the input X. This thery requires a lss functin L(Y,f(X)) fr penalizing errrs in predictin, and by far the mst cmmn and cnvenient is squared errr lss: L(Y,f(X)) = (Y f(x)) 2. This leads us t a criterin fr chsing f, EPE(f) = E(Y f(x)) 2 (2.9) = [y f(x)] 2 Pr(dx, dy), (2.10) the expected (squared) predictin errr. By cnditining 1 n X, wecan write EPE as EPE(f) =E X E Y X ( [Y f(x)] 2 X ) (2.11) and we see that it suffices t minimize EPE pintwise: The slutin is f(x) = argmin c E Y X ( [Y c] 2 X = x ). (2.12) f(x) =E(Y X = x), (2.13) the cnditinal expectatin, als knwn as the regressin functin. Thus the best predictin f Y at any pint X = x is the cnditinal mean, when best is measured by average squared errr. The nearest-neighbr methds attempt t directly implement this recipe using the training data. At each pint x, we might ask fr the average f all 1 Cnditining here amunts t factring the jint density Pr(X, Y )=Pr(Y X)Pr(X) where Pr(Y X) = Pr(Y,X)/Pr(X), and splitting up the bivariate integral accrdingly.

11 2.4 Statistical Decisin Thery 19 thse y i s with input x i = x. Since there is typically at mst ne bservatin at any pint x, we settle fr ˆf(x) =Ave(y i x i N k (x)), (2.14) where Ave dentes average, and N k (x) is the neighbrhd cntaining the k pints in T clsest t x. Tw apprximatins are happening here: expectatin is apprximated by averaging ver sample data; cnditining at a pint is relaxed t cnditining n sme regin clse t the target pint. Fr large training sample size N, the pints in the neighbrhd are likely t be clse t x, andask gets large the average will get mre stable. In fact, under mild regularity cnditins n the jint prbability distributin Pr(X, Y ), ne can shw that as N,k such that k/n 0, ˆf(x) E(Y X = x). In light f this, why lk further, since it seems we have a universal apprximatr? We ften d nt have very large samples. If the linear r sme mre structured mdel is apprpriate, then we can usually get a mre stable estimate than k-nearest neighbrs, althugh such knwledge has t be learned frm the data as well. There are ther prblems thugh, smetimes disastrus. In Sectin 2.5 we see that as the dimensin p gets large, s des the metric size f the k-nearest neighbr- hd. S settling fr nearest neighbrhd as a surrgate fr cnditining will fail us miserably. The cnvergence abve still hlds, but the rate f cnvergence decreases as the dimensin increases. Hw des linear regressin fit int this framewrk? The simplest explanatin is that ne assumes that the regressin functin f(x) is apprximately linear in its arguments: f(x) x T β. (2.15) This is a mdel-based apprach we specify a mdel fr the regressin functin. Plugging this linear mdel fr f(x) int EPE (2.9) and differentiating we can slve fr β theretically: β =[E(XX T )] 1 E(XY ). (2.16) Nte we have nt cnditined n X; rather we have used ur knwledge f the functinal relatinship t pl ver values f X. The least squares slutin (2.6) amunts t replacing the expectatin in (2.16) by averages ver the training data. S bth k-nearest neighbrs and least squares end up apprximating cnditinal expectatins by averages. But they differ dramatically in terms f mdel assumptins: Least squares assumes f(x) is well apprximated by a glbally linear functin.

12 20 2. Overview f Supervised Learning k-nearest neighbrs assumes f(x) is well apprximated by a lcally cnstant functin. Althugh the latter seems mre palatable, we have already seen that we may pay a price fr this flexibility. Many f the mre mdern techniques described in this bk are mdel based, althugh far mre flexible than the rigid linear mdel. Fr example, additive mdels assume that f(x) = p f j (X j ). (2.17) j=1 This retains the additivity f the linear mdel, but each crdinate functin f j is arbitrary. It turns ut that the ptimal estimate fr the additive mdel uses techniques such as k-nearest neighbrs t apprximate univariate cnditinal expectatins simultaneusly fr each f the crdinate functins. Thus the prblems f estimating a cnditinal expectatin in high dimensins are swept away in this case by impsing sme (ften unrealistic) mdel assumptins, in this case additivity. Are we happy with the criterin (2.11)? What happens if we replace the L 2 lss functin with the L 1 : E Y f(x)? The slutin in this case is the cnditinal median, ˆf(x) = median(y X = x), (2.18) which is a different measure f lcatin, and its estimates are mre rbust than thse fr the cnditinal mean. L 1 criteria have discntinuities in their derivatives, which have hindered their widespread use. Other mre resistant lss functins will be mentined in later chapters, but squared errr is analytically cnvenient and the mst ppular. What d we d when the utput is a categrical variable G? The same paradigm wrks here, except we need a different lss functin fr penalizing predictin errrs. An estimate Ĝ will assume values in G, the set f pssible classes. Our lss functin can be represented by a K K matrix L, where K =card(g). L will be zer n the diagnal and nnnegative elsewhere, where L(k, l) is the price paid fr classifying an bservatin belnging t class G k as G l. Mst ften we use the zer ne lss functin, where all misclassificatins are charged a single unit. The expected predictin errr is EPE = E[L(G, Ĝ(X))], (2.19) where again the expectatin is taken with respect t the jint distributin Pr(G, X). Again we cnditin, and can write EPE as EPE = E X K k=1 L[G k, Ĝ(X)]Pr(G k X) (2.20)

13 2.4 Statistical Decisin Thery 21 Bayes Optimal Classifier FIGURE 2.5. The ptimal Bayes decisin bundary fr the simulatin example f Figures 2.1, 2.2 and 2.3. Since the generating density is knwn fr each class, this bundary can be calculated exactly (Exercise 2.2). and again it suffices t minimize EPE pintwise: Ĝ(x) = argmin g G K k=1 With the 0 1 lss functin this simplifies t r simply L(G k,g)pr(g k X = x). (2.21) Ĝ(x) = argmin g G [1 Pr(g X = x)] (2.22) Ĝ(X) =G k if Pr(G k X = x) = max Pr(g X = x). (2.23) g G This reasnable slutin is knwn as the Bayes classifier, and says that we classify t the mst prbable class, using the cnditinal (discrete) distributin Pr(G X). Figure 2.5 shws the Bayes-ptimal decisin bundary fr ur simulatin example. The errr rate f the Bayes classifier is called the Bayes rate.

14 22 2. Overview f Supervised Learning Again we see that the k-nearest neighbr classifier directly apprximates this slutin a majrity vte in a nearest neighbrhd amunts t exactly this, except that cnditinal prbability at a pint is relaxed t cnditinal prbability within a neighbrhd f a pint, and prbabilities are estimated by training-sample prprtins. Suppse fr a tw-class prblem we had taken the dummy-variable apprach and cded G via a binary Y, fllwed by squared errr lss estimatin. Then ˆf(X) =E(Y X) =Pr(G = G 1 X) ifg 1 crrespnded t Y =1. Likewise fr a K-class prblem, E(Y k X) =Pr(G = G k X). This shws that ur dummy-variable regressin prcedure, fllwed by classificatin t the largest fitted value, is anther way f representing the Bayes classifier. Althugh this thery is exact, in practice prblems can ccur, depending n the regressin mdel used. Fr example, when linear regressin is used, ˆf(X) need nt be psitive, and we might be suspicius abut using it as an estimate f a prbability. We will discuss a variety f appraches t mdeling Pr(G X) in Chapter Lcal Methds in High Dimensins We have examined tw learning techniques fr predictin s far: the stable but biased linear mdel and the less stable but apparently less biased class f k-nearest-neighbr estimates. It wuld seem that with a reasnably large set f training data, we culd always apprximate the theretically ptimal cnditinal expectatin by k-nearest-neighbr averaging, since we shuld be able t find a fairly large neighbrhd f bservatins clse t any x and average them. This apprach and ur intuitin breaks dwn in high dimensins, and the phenmenn is cmmnly referred t as the curse f dimensinality (Bellman, 1961). There are many manifestatins f this prblem, and we will examine a few here. Cnsider the nearest-neighbr prcedure fr inputs unifrmly distributed in a p-dimensinal unit hypercube, as in Figure 2.6. Suppse we send ut a hypercubical neighbrhd abut a target pint t capture a fractin r f the bservatins. Since this crrespnds t a fractin r f the unit vlume, the expected edge length will be e p (r) =r 1/p. In ten dimensins e 10 (0.01) = 0.63 and e 10 (0.1) = 0.80, while the entire range fr each input is nly 1.0. S t capture 1% r 10% f the data t frm a lcal average, we must cver 63% r 80% f the range f each input variable. Such neighbrhds are n lnger lcal. Reducing r dramatically des nt help much either, since the fewer bservatins we average, the higher is the variance f ur fit. Anther cnsequence f the sparse sampling in high dimensins is that all sample pints are clse t an edge f the sample. Cnsider N data pints unifrmly distributed in a p-dimensinal unit ball centered at the rigin. Suppse we cnsider a nearest-neighbr estimate at the rigin. The median

15 2.5 Lcal Methds in High Dimensins Unit Cube 1 Distance d=10 d=3 d=2 d=1 Neighbrhd Fractin f Vlume FIGURE 2.6. The curse f dimensinality is well illustrated by a subcubical neighbrhd fr unifrm data in a unit cube. The figure n the right shws the side-length f the subcube needed t capture a fractin r f the vlume f the data, fr different dimensins p. In ten dimensins we need t cver 80% f the range f each crdinate t capture 10% f the data. distance frm the rigin t the clsest data pint is given by the expressin ( d(p, N) = 1 1 1/N ) 1/p (2.24) 2 (Exercise 2.3). A mre cmplicated expressin exists fr the mean distance t the clsest pint. Fr N = 500, p =10,d(p, N) 0.52, mre than halfway t the bundary. Hence mst data pints are clser t the bundary f the sample space than t any ther data pint. The reasn that this presents a prblem is that predictin is much mre difficult near the edges f the training sample. One must extraplate frm neighbring sample pints rather than interplate between them. Anther manifestatin f the curse is that the sampling density is prprtinal t N 1/p, where p is the dimensin f the input space and N is the sample size. Thus, if N 1 = 100 represents a dense sample fr a single input prblem, then N 10 = is the sample size required fr the same sampling density with 10 inputs. Thus in high dimensins all feasible training samples sparsely ppulate the input space. Let us cnstruct anther unifrm example. Suppse we have 1000 training examples x i generated unifrmly n [ 1, 1] p. Assume that the true relatinship between X and Y is Y = f(x) =e 8 X 2, withut any measurement errr. We use the 1-nearest-neighbr rule t predict y 0 at the test-pint x 0 = 0. Dente the training set by T.Wecan

16 24 2. Overview f Supervised Learning cmpute the expected predictin errr at x 0 fr ur prcedure, averaging ver all such samples f size Since the prblem is deterministic, this is the mean squared errr (MSE) fr estimating f(0): MSE(x 0 ) = E T [f(x 0 ) ŷ 0 ] 2 = E T [ŷ 0 E T (ŷ 0 )] 2 +[E T (ŷ 0 ) f(x 0 )] 2 = Var T (ŷ 0 ) + Bias 2 (ŷ 0 ). (2.25) Figure 2.7 illustrates the setup. We have brken dwn the MSE int tw cmpnents that will becme familiar as we prceed: variance and squared bias. Such a decmpsitin is always pssible and ften useful, and is knwn as the bias variance decmpsitin. Unless the nearest neighbr is at 0, ŷ 0 will be smaller than f(0) in this example, and s the average estimate will be biased dwnward. The variance is due t the sampling variance f the 1-nearest neighbr. In lw dimensins and with N = 1000, the nearest neighbr is very clse t 0, and s bth the bias and variance are small. As the dimensin increases, the nearest neighbr tends t stray further frm the target pint, and bth bias and variance are incurred. By p = 10, fr mre than 99% f the samples the nearest neighbr is a distance greater than 0.5 frm the rigin. Thus as p increases, the estimate tends t be 0 mre ften than nt, and hence the MSE levels ff at 1.0, as des the bias, and the variance starts drpping (an artifact f this example). Althugh this is a highly cntrived example, similar phenmena ccur mre generally. The cmplexity f functins f many variables can grw expnentially with the dimensin, and if we wish t be able t estimate such functins with the same accuracy as functin in lw dimensins, then we need the size f ur training set t grw expnentially as well. In this example, the functin is a cmplex interactin f all p variables invlved. The dependence f the bias term n distance depends n the truth, and it need nt always dminate with 1-nearest neighbr. Fr example, if the functin always invlves nly a few dimensins as in Figure 2.8, then the variance can dminate instead. Suppse, n the ther hand, that we knw that the relatinship between Y and X is linear, Y = X T β + ε, (2.26) where ε N(0,σ 2 ) and we fit the mdel by least squares t the training data. Fr an arbitrary test pint x 0,wehaveŷ 0 = x T ˆβ, 0 which can be written as ŷ 0 = x T 0 β + N i=1 l i(x 0 )ε i, where l i (x 0 )istheith element f X(X T X) 1 x 0. Since under this mdel the least squares estimates are

17 2.5 Lcal Methds in High Dimensins 25 1-NN in One Dimensin 1-NN in One vs. Tw Dimensins f(x) X X X1 Distance t 1-NN vs. Dimensin MSE vs. Dimensin Average Distance t Nearest Neighbr Mse MSE Variance Sq. Bias Dimensin Dimensin FIGURE 2.7. A simulatin example, demnstrating the curse f dimensinality and its effect n MSE, bias and variance. The input features are unifrmly distributed in [ 1, 1] p fr p =1,...,10 The tp left panel shws the target functin (n nise) in IR: f(x) =e 8 X 2, and demnstrates the errr that 1-nearest neighbr makes in estimating f(0). The training pint is indicated by the blue tick mark. The tp right panel illustrates why the radius f the 1-nearest neighbrhd increases with dimensin p. The lwer left panel shws the average radius f the 1-nearest neighbrhds. The lwer-right panel shws the MSE, squared bias and variance curves as a functin f dimensin p.

18 26 2. Overview f Supervised Learning 1-NN in One Dimensin MSE vs. Dimensin f(x) MSE MSE Variance Sq. Bias X Dimensin FIGURE 2.8. A simulatin example with the same setup as in Figure 2.7. Here the functin is cnstant in all but ne dimensin: F (X) = 1 2 (X1 +1)3. The variance dminates. unbiased, we find that EPE(x 0 ) = E y0 x 0 E T (y 0 ŷ 0 ) 2 = Var(y 0 x 0 )+E T [ŷ 0 E T ŷ 0 ] 2 +[E T ŷ 0 x T 0 β] 2 = Var(y 0 x 0 )+Var T (ŷ 0 ) + Bias 2 (ŷ 0 ) = σ 2 +E T x T 0 (X T X) 1 x 0 σ (2.27) Here we have incurred an additinal variance σ 2 in the predictin errr, since ur target is nt deterministic. There is n bias, and the variance depends n x 0.IfN is large and T were selected at randm, and assuming E(X) = 0, then X T X NCv(X) and E x0 EPE(x 0 ) E x0 x T 0 Cv(X) 1 x 0 σ 2 /N + σ 2 = trace[cv(x) 1 Cv(x 0 )]σ 2 /N + σ 2 = σ 2 (p/n)+σ 2. (2.28) Here we see that the expected EPE increases linearly as a functin f p, with slpe σ 2 /N.IfN is large and/r σ 2 is small, this grwth in variance is negligible (0 in the deterministic case). By impsing sme heavy restrictins n the class f mdels being fitted, we have avided the curse f dimensinality. Sme f the technical details in (2.27) and (2.28) are derived in Exercise 2.5. Figure 2.9 cmpares 1-nearest neighbr vs. least squares in tw situatins, bth f which have the frm Y = f(x) +ε, X unifrm as befre, and ε N(0, 1). The sample size is N = 500. Fr the red curve, f(x) is

19 2.5 Lcal Methds in High Dimensins 27 EPE Rati Expected Predictin Errr f 1NN vs. OLS Linear Cubic Dimensin FIGURE 2.9. The curves shw the expected predictin errr (at x 0 =0)fr 1-nearest neighbr relative t least squares fr the mdel Y = f(x)+ ε. Fr the range curve, f(x) =x 1, while fr the blue curve f(x) = 1 2 (x1 +1)3. linear in the first crdinate, fr the green curve, cubic as in Figure 2.8. Shwn is the relative EPE f 1-nearest neighbr t least squares, which appears t start at arund 2 fr the linear case. Least squares is unbiased in this case, and as discussed abve the EPE is slightly abve σ 2 =1. The EPE fr 1-nearest neighbr is always abve 2, since the variance f ˆf(x 0 )inthiscaseisatleastσ 2, and the rati increases with dimensin as the nearest neighbr strays frm the target pint. Fr the cubic case, least squares is biased, which mderates the rati. Clearly we culd manufacture examples where the bias f least squares wuld dminate the variance, and the 1-nearest neighbr wuld cme ut the winner. By relying n rigid assumptins, the linear mdel has n bias at all and negligible variance, while the errr in 1-nearest neighbr is substantially larger. Hwever, if the assumptins are wrng, all bets are ff and the 1-nearest neighbr may dminate. We will see that there is a whle spectrum f mdels between the rigid linear mdels and the extremely flexible 1-nearest-neighbr mdels, each with their wn assumptins and biases, which have been prpsed specifically t avid the expnential grwth in cmplexity f functins in high dimensins by drawing heavily n these assumptins. Befre we delve mre deeply, let us elabrate a bit n the cncept f statistical mdels and see hw they fit int the predictin framewrk.

20 28 2. Overview f Supervised Learning 2.6 Statistical Mdels, Supervised Learning and Functin Apprximatin Our gal is t find a useful apprximatin ˆf(x) t the functin f(x) that underlies the predictive relatinship between the inputs and utputs. In the theretical setting f Sectin 2.4, we saw that squared errr lss lead us t the regressin functin f(x) =E(Y X = x) fr a quantitative respnse. The class f nearest-neighbr methds can be viewed as direct estimates f this cnditinal expectatin, but we have seen that they can fail in at least tw ways: if the dimensin f the input space is high, the nearest neighbrs need nt be clse t the target pint, and can result in large errrs; if special structure is knwn t exist, this can be used t reduce bth the bias and the variance f the estimates. We anticipate using ther classes f mdels fr f(x), in many cases specifically designed t vercme the dimensinality prblems, and here we discuss a framewrk fr incrprating them int the predictin prblem A Statistical Mdel fr the Jint Distributin Pr(X, Y ) Suppse in fact that ur data arse frm a statistical mdel Y = f(x)+ε, (2.29) where the randm errr ε has E(ε) = 0 and is independent f X. Nte that fr this mdel, f(x) =E(Y X = x), and in fact the cnditinal distributin Pr(Y X) depends n X nly thrugh the cnditinal mean f(x). The additive errr mdel is a useful apprximatin t the truth. Fr mst systems the input utput pairs (X, Y ) will nt have a deterministic relatinship Y = f(x). Generally there will be ther unmeasured variables that als cntribute t Y, including measurement errr. The additive mdel assumes that we can capture all these departures frm a deterministic relatinship via the errr ε. Fr sme prblems a deterministic relatinship des hld. Many f the classificatin prblems studied in machine learning are f this frm, where the respnse surface can be thught f as a clred map defined in IR p. The training data cnsist f clred examples frm the map {x i,g i },and the gal is t be able t clr any pint. Here the functin is deterministic, and the randmness enters thrugh the x lcatin f the training pints. Fr the mment we will nt pursue such prblems, but will see that they can be handled by techniques apprpriate fr the errr-based mdels. The assumptin in (2.29) that the errrs are independent and identically distributed is nt strictly necessary, but seems t be at the back f ur mind

21 2.6 Statistical Mdels, Supervised Learning and Functin Apprximatin 29 when we average squared errrs unifrmly in ur EPE criterin. With such a mdel it becmes natural t use least squares as a data criterin fr mdel estimatin as in (2.1). Simple mdificatins can be made t avid the independence assumptin; fr example, we can have Var(Y X = x) = σ(x), and nw bth the mean and variance depend n X. In general the cnditinal distributin Pr(Y X) can depend n X in cmplicated ways, but the additive errr mdel precludes these. S far we have cncentrated n the quantitative respnse. Additive errr mdels are typically nt used fr qualitative utputs G; in this case the target functin p(x) is the cnditinal density Pr(G X), and this is mdeled directly. Fr example, fr tw-class data, it is ften reasnable t assume that the data arise frm independent binary trials, with the prbability f ne particular utcme being p(x), and the ther 1 p(x). Thus if Y is the 0 1 cded versin f G, then E(Y X = x) =p(x), but the variance depends n x as well: Var(Y X = x) =p(x)[1 p(x)] Supervised Learning Befre we launch int mre statistically riented jargn, we present the functin-fitting paradigm frm a machine learning pint f view. Suppse fr simplicity that the errrs are additive and that the mdel Y = f(x)+ε is a reasnable assumptin. Supervised learning attempts t learn f by example thrugh a teacher. One bserves the system under study, bth the inputs and utputs, and assembles a training set f bservatins T = (x i,y i ),i=1,...,n. The bserved input values t the system x i are als fed int an artificial system, knwn as a learning algrithm (usually a cmputer prgram), which als prduces utputs ˆf(x i ) in respnse t the inputs. The learning algrithm has the prperty that it can mdify its input/utput relatinship ˆf in respnse t differences y i ˆf(x i ) between the riginal and generated utputs. This prcess is knwn as learning by example. Upn cmpletin f the learning prcess the hpe is that the artificial and real utputs will be clse enugh t be useful fr all sets f inputs likely t be encuntered in practice Functin Apprximatin The learning paradigm f the previus sectin has been the mtivatin fr research int the supervised learning prblem in the fields f machine learning (with analgies t human reasning) and neural netwrks (with bilgical analgies t the brain). The apprach taken in applied mathematics and statistics has been frm the perspective f functin apprximatin and estimatin. Here the data pairs {x i,y i } are viewed as pints in a (p + 1)-dimensinal Euclidean space. The functin f(x) has dmain equal t the p-dimensinal input subspace, and is related t the data via a mdel

22 30 2. Overview f Supervised Learning such as y i = f(x i )+ε i. Fr cnvenience in this chapter we will assume the dmain is IR p,ap-dimensinal Euclidean space, althugh in general the inputs can be f mixed type. The gal is t btain a useful apprximatin t f(x) fr all x in sme regin f IR p, given the representatins in T. Althugh smewhat less glamrus than the learning paradigm, treating supervised learning as a prblem in functin apprximatin encurages the gemetrical cncepts f Euclidean spaces and mathematical cncepts f prbabilistic inference t be applied t the prblem. This is the apprach taken in this bk. Many f the apprximatins we will encunter have assciated a set f parameters θ that can be mdified t suit the data at hand. Fr example, the linear mdel f(x) =x T β has θ = β. Anther class f useful apprximatrs can be expressed as linear basis expansins f θ (x) = K h k (x)θ k, (2.30) k=1 where the h k are a suitable set f functins r transfrmatins f the input vectr x. Traditinal examples are plynmial and trignmetric expansins, where fr example h k might be x 2 1, x 1 x 2 2,cs(x 1 )andsn.we als encunter nnlinear expansins, such as the sigmid transfrmatin cmmn t neural netwrk mdels, h k (x) = 1 1+exp( x T β k ). (2.31) We can use least squares t estimate the parameters θ in f θ as we did fr the linear mdel, by minimizing the residual sum-f-squares RSS(θ) = N (y i f θ (x i )) 2 (2.32) i=1 as a functin f θ. This seems a reasnable criterin fr an additive errr mdel. In terms f functin apprximatin, we imagine ur parameterized functin as a surface in p + 1 space, and what we bserve are nisy realizatins frm it. This is easy t visualize when p = 2 and the vertical crdinate is the utput y, as in Figure The nise is in the utput crdinate, s we find the set f parameters such that the fitted surface gets as clse t the bserved pints as pssible, where clse is measured by the sum f squared vertical errrs in RSS(θ). Fr the linear mdel we get a simple clsed frm slutin t the minimizatin prblem. This is als true fr the basis functin methds, if the basis functins themselves d nt have any hidden parameters. Otherwise the slutin requires either iterative methds r numerical ptimizatin. While least squares is generally very cnvenient, it is nt the nly criterin used and in sme cases wuld nt make much sense. A mre general

23 2.6 Statistical Mdels, Supervised Learning and Functin Apprximatin 31 FIGURE Least squares fitting f a functin f tw inputs. The parameters f f θ (x) are chsen s as t minimize the sum-f-squared vertical errrs. principle fr estimatin is maximum likelihd estimatin. Suppse we have a randm sample y i,i=1,...,n frm a density Pr θ (y) indexed by sme parameters θ. The lg-prbability f the bserved sample is L(θ) = N lg Pr θ (y i ). (2.33) i=1 The principle f maximum likelihd assumes that the mst reasnable values fr θ are thse fr which the prbability f the bserved sample is largest. Least squares fr the additive errr mdel Y = f θ (X) +ε, with ε N(0,σ 2 ), is equivalent t maximum likelihd using the cnditinal likelihd Pr(Y X, θ) =N(f θ (X),σ 2 ). (2.34) S althugh the additinal assumptin f nrmality seems mre restrictive, the results are the same. The lg-likelihd f the data is L(θ) = N 2 lg(2π) N lg σ 1 2σ 2 N (y i f θ (x i )) 2, (2.35) i=1 and the nly term invlving θ is the last, which is RSS(θ) up t a scalar negative multiplier. A mre interesting example is the multinmial likelihd fr the regressin functin Pr(G X) fr a qualitative utput G. Suppse we have a mdel Pr(G = G k X = x) =p k,θ (x), k=1,...,k fr the cnditinal prbability f each class given X, indexed by the parameter vectr θ. Then the

24 32 2. Overview f Supervised Learning lg-likelihd (als referred t as the crss-entrpy) is L(θ) = N lg p gi,θ(x i ), (2.36) i=1 and when maximized it delivers values f θ that best cnfrm with the data in this likelihd sense. 2.7 Structured Regressin Mdels We have seen that althugh nearest-neighbr and ther lcal methds fcus directly n estimating the functin at a pint, they face prblems in high dimensins. They may als be inapprpriate even in lw dimensins in cases where mre structured appraches can make mre efficient use f the data. This sectin intrduces classes f such structured appraches. Befre we prceed, thugh, we discuss further the need fr such classes Difficulty f the Prblem Cnsider the RSS criterin fr an arbitrary functin f, RSS(f) = N (y i f(x i )) 2. (2.37) i=1 Minimizing (2.37) leads t infinitely many slutins: any functin ˆf passing thrugh the training pints (x i,y i ) is a slutin. Any particular slutin chsen might be a pr predictr at test pints different frm the training pints. If there are multiple bservatin pairs x i,y il,l=1,...,n i at each value f x i, the risk is limited. In this case, the slutins pass thrugh the average values f the y il at each x i ; see Exercise 2.6. The situatin is similar t the ne we have already visited in Sectin 2.4; indeed, (2.37) is the finite sample versin f (2.11) n page 18. If the sample size N were sufficiently large such that repeats were guaranteed and densely arranged, it wuld seem that these slutins might all tend t the limiting cnditinal expectatin. In rder t btain useful results fr finite N, we must restrict the eligible slutins t (2.37) t a smaller set f functins. Hw t decide n the nature f the restrictins is based n cnsideratins utside f the data. These restrictins are smetimes encded via the parametric representatin f f θ, r may be built int the learning methd itself, either implicitly r explicitly. These restricted classes f slutins are the majr tpic f this bk. One thing shuld be clear, thugh. Any restrictins impsed n f that lead t a unique slutin t (2.37) d nt really remve the ambiguity

25 2.8 Classes f Restricted Estimatrs 33 caused by the multiplicity f slutins. There are infinitely many pssible restrictins, each leading t a unique slutin, s the ambiguity has simply been transferred t the chice f cnstraint. In general the cnstraints impsed by mst learning methds can be described as cmplexity restrictins f ne kind r anther. This usually means sme kind f regular behavir in small neighbrhds f the input space. That is, fr all input pints x sufficiently clse t each ther in sme metric, ˆf exhibits sme special structure such as nearly cnstant, linear r lw-rder plynmial behavir. The estimatr is then btained by averaging r plynmial fitting in that neighbrhd. The strength f the cnstraint is dictated by the neighbrhd size. The larger the size f the neighbrhd, the strnger the cnstraint, and the mre sensitive the slutin is t the particular chice f cnstraint. Fr example, lcal cnstant fits in infinitesimally small neighbrhds is n cnstraint at all; lcal linear fits in very large neighbrhds is almst a glbally linear mdel, and is very restrictive. The nature f the cnstraint depends n the metric used. Sme methds, such as kernel and lcal regressin and tree-based methds, directly specify the metric and size f the neighbrhd. The nearest-neighbr methds discussed s far are based n the assumptin that lcally the functin is cnstant; clse t a target input x 0, the functin des nt change much, and s clse utputs can be averaged t prduce ˆf(x 0 ). Other methds such as splines, neural netwrks and basis-functin methds implicitly define neighbrhds f lcal behavir. In Sectin we discuss the cncept f an equivalent kernel (see Figure 5.8 n page 157), which describes this lcal dependence fr any methd linear in the utputs. These equivalent kernels in many cases lk just like the explicitly defined weighting kernels discussed abve peaked at the target pint and falling away smthly away frm it. One fact shuld be clear by nw. Any methd that attempts t prduce lcally varying functins in small istrpic neighbrhds will run int prblems in high dimensins again the curse f dimensinality. And cnversely, all methds that vercme the dimensinality prblems have an assciated and ften implicit r adaptive metric fr measuring neighbrhds, which basically des nt allw the neighbrhd t be simultaneusly small in all directins. 2.8 Classes f Restricted Estimatrs The variety f nnparametric regressin techniques r learning methds fall int a number f different classes depending n the nature f the restrictins impsed. These classes are nt distinct, and indeed sme methds fall in several classes. Here we give a brief summary, since detailed descriptins

26 34 2. Overview f Supervised Learning are given in later chapters. Each f the classes has assciated with it ne r mre parameters, smetimes apprpriately called smthing parameters, that cntrl the effective size f the lcal neighbrhd. Here we describe three brad classes Rughness Penalty and Bayesian Methds Here the class f functins is cntrlled by explicitly penalizing RSS(f) with a rughness penalty PRSS(f; λ) = RSS(f)+λJ(f). (2.38) The user-selected functinal J(f) will be large fr functins f that vary t rapidly ver small regins f input space. Fr example, the ppular cubic smthing spline fr ne-dimensinal inputs is the slutin t the penalized least-squares criterin PRSS(f; λ) = N (y i f(x i )) 2 + λ [f (x)] 2 dx. (2.39) i=1 The rughness penalty here cntrls large values f the secnd derivative f f, and the amunt f penalty is dictated by λ 0. Fr λ = 0 n penalty is impsed, and any interplating functin will d, while fr λ = nly functins linear in x are permitted. Penalty functinals J can be cnstructed fr functins in any dimensin, and special versins can be created t impse special structure. Fr example, additive penalties J(f) = p j=1 J(f j) are used in cnjunctin with additive functins f(x) = p j=1 f j(x j ) t create additive mdels with smth crdinate functins. Similarly, prjectin pursuit regressin mdels have f(x) = M m=1 g m(αmx) T fr adaptively chsen directins α m,and the functins g m can each have an assciated rughness penalty. Penalty functin, r regularizatin methds, express ur prir belief that the type f functins we seek exhibit a certain type f smth behavir, and indeed can usually be cast in a Bayesian framewrk. The penalty J crrespnds t a lg-prir, and PRSS(f; λ) the lg-psterir distributin, and minimizing PRSS(f; λ) amunts t finding the psterir mde. We discuss rughness-penalty appraches in Chapter 5 and the Bayesian paradigm in Chapter Kernel Methds and Lcal Regressin These methds can be thught f as explicitly prviding estimates f the regressin functin r cnditinal expectatin by specifying the nature f the lcal neighbrhd, and f the class f regular functins fitted lcally. The lcal neighbrhd is specified by a kernel functin K λ (x 0,x) which assigns

What is Statistical Learning?

What is Statistical Learning? What is Statistical Learning? Sales 5 10 15 20 25 Sales 5 10 15 20 25 Sales 5 10 15 20 25 0 50 100 200 300 TV 0 10 20 30 40 50 Radi 0 20 40 60 80 100 Newspaper Shwn are Sales vs TV, Radi and Newspaper,

More information

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeff Reading: Chapter 2 STATS 202: Data mining and analysis September 27, 2017 1 / 20 Supervised vs. unsupervised learning In unsupervised

More information

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeff Reading: Chapter 2 STATS 202: Data mining and analysis September 27, 2017 1 / 20 Supervised vs. unsupervised learning In unsupervised

More information

Pattern Recognition 2014 Support Vector Machines

Pattern Recognition 2014 Support Vector Machines Pattern Recgnitin 2014 Supprt Vectr Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 1 / 55 Overview 1 Separable Case 2 Kernel Functins 3 Allwing Errrs (Sft

More information

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) > Btstrap Methd > # Purpse: understand hw btstrap methd wrks > bs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(bs) > mean(bs) [1] 21.64625 > # estimate f lambda > lambda = 1/mean(bs);

More information

, which yields. where z1. and z2

, which yields. where z1. and z2 The Gaussian r Nrmal PDF, Page 1 The Gaussian r Nrmal Prbability Density Functin Authr: Jhn M Cimbala, Penn State University Latest revisin: 11 September 13 The Gaussian r Nrmal Prbability Density Functin

More information

In God we trust, all others bring data.

In God we trust, all others bring data. ى ىل ل كمس مو مكفنم ذ In Gd we trust, all thers bring data. William Edwards Deming (1900-1993) 1 We have been gratified by the ppularity f the first editin f The Elements f Statistical Learning. This,

More information

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression 3.3.4 Prstate Cancer Data Example (Cntinued) 3.4 Shrinkage Methds 61 Table 3.3 shws the cefficients frm a number f different selectin and shrinkage methds. They are best-subset selectin using an all-subsets

More information

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels Mtivating Example Memry-Based Learning Instance-Based Learning K-earest eighbr Inductive Assumptin Similar inputs map t similar utputs If nt true => learning is impssible If true => learning reduces t

More information

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines COMP 551 Applied Machine Learning Lecture 11: Supprt Vectr Machines Instructr: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/cmp551 Unless therwise nted, all material psted fr this curse

More information

Simple Linear Regression (single variable)

Simple Linear Regression (single variable) Simple Linear Regressin (single variable) Intrductin t Machine Learning Marek Petrik January 31, 2017 Sme f the figures in this presentatin are taken frm An Intrductin t Statistical Learning, with applicatins

More information

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification COMP 551 Applied Machine Learning Lecture 5: Generative mdels fr linear classificatin Instructr: Herke van Hf (herke.vanhf@mail.mcgill.ca) Slides mstly by: Jelle Pineau Class web page: www.cs.mcgill.ca/~hvanh2/cmp551

More information

Resampling Methods. Chapter 5. Chapter 5 1 / 52

Resampling Methods. Chapter 5. Chapter 5 1 / 52 Resampling Methds Chapter 5 Chapter 5 1 / 52 1 51 Validatin set apprach 2 52 Crss validatin 3 53 Btstrap Chapter 5 2 / 52 Abut Resampling An imprtant statistical tl Pretending the data as ppulatin and

More information

COMP 551 Applied Machine Learning Lecture 4: Linear classification

COMP 551 Applied Machine Learning Lecture 4: Linear classification COMP 551 Applied Machine Learning Lecture 4: Linear classificatin Instructr: Jelle Pineau (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/cmp551 Unless therwise nted, all material psted

More information

5 th grade Common Core Standards

5 th grade Common Core Standards 5 th grade Cmmn Cre Standards In Grade 5, instructinal time shuld fcus n three critical areas: (1) develping fluency with additin and subtractin f fractins, and develping understanding f the multiplicatin

More information

Chapter 3: Cluster Analysis

Chapter 3: Cluster Analysis Chapter 3: Cluster Analysis } 3.1 Basic Cncepts f Clustering 3.1.1 Cluster Analysis 3.1. Clustering Categries } 3. Partitining Methds 3..1 The principle 3.. K-Means Methd 3..3 K-Medids Methd 3..4 CLARA

More information

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data Outline IAML: Lgistic Regressin Charles Suttn and Victr Lavrenk Schl f Infrmatics Semester Lgistic functin Lgistic regressin Learning lgistic regressin Optimizatin The pwer f nn-linear basis functins Least-squares

More information

Distributions, spatial statistics and a Bayesian perspective

Distributions, spatial statistics and a Bayesian perspective Distributins, spatial statistics and a Bayesian perspective Dug Nychka Natinal Center fr Atmspheric Research Distributins and densities Cnditinal distributins and Bayes Thm Bivariate nrmal Spatial statistics

More information

Smoothing, penalized least squares and splines

Smoothing, penalized least squares and splines Smthing, penalized least squares and splines Duglas Nychka, www.image.ucar.edu/~nychka Lcally weighted averages Penalized least squares smthers Prperties f smthers Splines and Reprducing Kernels The interplatin

More information

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d) COMP 551 Applied Machine Learning Lecture 9: Supprt Vectr Machines (cnt d) Instructr: Herke van Hf (herke.vanhf@mail.mcgill.ca) Slides mstly by: Class web page: www.cs.mcgill.ca/~hvanh2/cmp551 Unless therwise

More information

The blessing of dimensionality for kernel methods

The blessing of dimensionality for kernel methods fr kernel methds Building classifiers in high dimensinal space Pierre Dupnt Pierre.Dupnt@ucluvain.be Classifiers define decisin surfaces in sme feature space where the data is either initially represented

More information

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017 Resampling Methds Crss-validatin, Btstrapping Marek Petrik 2/21/2017 Sme f the figures in this presentatin are taken frm An Intrductin t Statistical Learning, with applicatins in R (Springer, 2013) with

More information

Lead/Lag Compensator Frequency Domain Properties and Design Methods

Lead/Lag Compensator Frequency Domain Properties and Design Methods Lectures 6 and 7 Lead/Lag Cmpensatr Frequency Dmain Prperties and Design Methds Definitin Cnsider the cmpensatr (ie cntrller Fr, it is called a lag cmpensatr s K Fr s, it is called a lead cmpensatr Ntatin

More information

A Matrix Representation of Panel Data

A Matrix Representation of Panel Data web Extensin 6 Appendix 6.A A Matrix Representatin f Panel Data Panel data mdels cme in tw brad varieties, distinct intercept DGPs and errr cmpnent DGPs. his appendix presents matrix algebra representatins

More information

Tree Structured Classifier

Tree Structured Classifier Tree Structured Classifier Reference: Classificatin and Regressin Trees by L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stne, Chapman & Hall, 98. A Medical Eample (CART): Predict high risk patients

More information

IAML: Support Vector Machines

IAML: Support Vector Machines 1 / 22 IAML: Supprt Vectr Machines Charles Suttn and Victr Lavrenk Schl f Infrmatics Semester 1 2 / 22 Outline Separating hyperplane with maimum margin Nn-separable training data Epanding the input int

More information

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank CAUSAL INFERENCE Technical Track Sessin I Phillippe Leite The Wrld Bank These slides were develped by Christel Vermeersch and mdified by Phillippe Leite fr the purpse f this wrkshp Plicy questins are causal

More information

Support Vector Machines and Flexible Discriminants

Support Vector Machines and Flexible Discriminants 12 Supprt Vectr Machines and Flexible Discriminants This is page 417 Printer: Opaque this 12.1 Intrductin In this chapter we describe generalizatins f linear decisin bundaries fr classificatin. Optimal

More information

CN700 Additive Models and Trees Chapter 9: Hastie et al. (2001)

CN700 Additive Models and Trees Chapter 9: Hastie et al. (2001) CN700 Additive Mdels and Trees Chapter 9: Hastie et al. (2001) Madhusudana Shashanka Department f Cgnitive and Neural Systems Bstn University CN700 - Additive Mdels and Trees March 02, 2004 p.1/34 Overview

More information

Activity Guide Loops and Random Numbers

Activity Guide Loops and Random Numbers Unit 3 Lessn 7 Name(s) Perid Date Activity Guide Lps and Randm Numbers CS Cntent Lps are a relatively straightfrward idea in prgramming - yu want a certain chunk f cde t run repeatedly - but it takes a

More information

Building to Transformations on Coordinate Axis Grade 5: Geometry Graph points on the coordinate plane to solve real-world and mathematical problems.

Building to Transformations on Coordinate Axis Grade 5: Geometry Graph points on the coordinate plane to solve real-world and mathematical problems. Building t Transfrmatins n Crdinate Axis Grade 5: Gemetry Graph pints n the crdinate plane t slve real-wrld and mathematical prblems. 5.G.1. Use a pair f perpendicular number lines, called axes, t define

More information

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter Midwest Big Data Summer Schl: Machine Learning I: Intrductin Kris De Brabanter kbrabant@iastate.edu Iwa State University Department f Statistics Department f Cmputer Science June 24, 2016 1/24 Outline

More information

Differentiation Applications 1: Related Rates

Differentiation Applications 1: Related Rates Differentiatin Applicatins 1: Related Rates 151 Differentiatin Applicatins 1: Related Rates Mdel 1: Sliding Ladder 10 ladder y 10 ladder 10 ladder A 10 ft ladder is leaning against a wall when the bttm

More information

7 TH GRADE MATH STANDARDS

7 TH GRADE MATH STANDARDS ALGEBRA STANDARDS Gal 1: Students will use the language f algebra t explre, describe, represent, and analyze number expressins and relatins 7 TH GRADE MATH STANDARDS 7.M.1.1: (Cmprehensin) Select, use,

More information

Support-Vector Machines

Support-Vector Machines Supprt-Vectr Machines Intrductin Supprt vectr machine is a linear machine with sme very nice prperties. Haykin chapter 6. See Alpaydin chapter 13 fr similar cntent. Nte: Part f this lecture drew material

More information

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression 4th Indian Institute f Astrphysics - PennState Astrstatistics Schl July, 2013 Vainu Bappu Observatry, Kavalur Crrelatin and Regressin Rahul Ry Indian Statistical Institute, Delhi. Crrelatin Cnsider a tw

More information

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving.

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving. Sectin 3.2: Many f yu WILL need t watch the crrespnding vides fr this sectin n MyOpenMath! This sectin is primarily fcused n tls t aid us in finding rts/zers/ -intercepts f plynmials. Essentially, ur fcus

More information

CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS

CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS 1 Influential bservatins are bservatins whse presence in the data can have a distrting effect n the parameter estimates and pssibly the entire analysis,

More information

Part 3 Introduction to statistical classification techniques

Part 3 Introduction to statistical classification techniques Part 3 Intrductin t statistical classificatin techniques Machine Learning, Part 3, March 07 Fabi Rli Preamble ØIn Part we have seen that if we knw: Psterir prbabilities P(ω i / ) Or the equivalent terms

More information

Determining the Accuracy of Modal Parameter Estimation Methods

Determining the Accuracy of Modal Parameter Estimation Methods Determining the Accuracy f Mdal Parameter Estimatin Methds by Michael Lee Ph.D., P.E. & Mar Richardsn Ph.D. Structural Measurement Systems Milpitas, CA Abstract The mst cmmn type f mdal testing system

More information

the results to larger systems due to prop'erties of the projection algorithm. First, the number of hidden nodes must

the results to larger systems due to prop'erties of the projection algorithm. First, the number of hidden nodes must M.E. Aggune, M.J. Dambrg, M.A. El-Sharkawi, R.J. Marks II and L.E. Atlas, "Dynamic and static security assessment f pwer systems using artificial neural netwrks", Prceedings f the NSF Wrkshp n Applicatins

More information

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction T-61.5060 Algrithmic methds fr data mining Slide set 6: dimensinality reductin reading assignment LRU bk: 11.1 11.3 PCA tutrial in mycurses (ptinal) ptinal: An Elementary Prf f a Therem f Jhnsn and Lindenstrauss,

More information

AP Statistics Notes Unit Two: The Normal Distributions

AP Statistics Notes Unit Two: The Normal Distributions AP Statistics Ntes Unit Tw: The Nrmal Distributins Syllabus Objectives: 1.5 The student will summarize distributins f data measuring the psitin using quartiles, percentiles, and standardized scres (z-scres).

More information

MODULE FOUR. This module addresses functions. SC Academic Elementary Algebra Standards:

MODULE FOUR. This module addresses functions. SC Academic Elementary Algebra Standards: MODULE FOUR This mdule addresses functins SC Academic Standards: EA-3.1 Classify a relatinship as being either a functin r nt a functin when given data as a table, set f rdered pairs, r graph. EA-3.2 Use

More information

Module 4: General Formulation of Electric Circuit Theory

Module 4: General Formulation of Electric Circuit Theory Mdule 4: General Frmulatin f Electric Circuit Thery 4. General Frmulatin f Electric Circuit Thery All electrmagnetic phenmena are described at a fundamental level by Maxwell's equatins and the assciated

More information

Kinetic Model Completeness

Kinetic Model Completeness 5.68J/10.652J Spring 2003 Lecture Ntes Tuesday April 15, 2003 Kinetic Mdel Cmpleteness We say a chemical kinetic mdel is cmplete fr a particular reactin cnditin when it cntains all the species and reactins

More information

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw: In SMV I IAML: Supprt Vectr Machines II Nigel Gddard Schl f Infrmatics Semester 1 We sa: Ma margin trick Gemetry f the margin and h t cmpute it Finding the ma margin hyperplane using a cnstrained ptimizatin

More information

Least Squares Optimal Filtering with Multirate Observations

Least Squares Optimal Filtering with Multirate Observations Prc. 36th Asilmar Cnf. n Signals, Systems, and Cmputers, Pacific Grve, CA, Nvember 2002 Least Squares Optimal Filtering with Multirate Observatins Charles W. herrien and Anthny H. Hawes Department f Electrical

More information

Lyapunov Stability Stability of Equilibrium Points

Lyapunov Stability Stability of Equilibrium Points Lyapunv Stability Stability f Equilibrium Pints 1. Stability f Equilibrium Pints - Definitins In this sectin we cnsider n-th rder nnlinear time varying cntinuus time (C) systems f the frm x = f ( t, x),

More information

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came. MATH 1342 Ch. 24 April 25 and 27, 2013 Page 1 f 5 CHAPTER 24: INFERENCE IN REGRESSION Chapters 4 and 5: Relatinships between tw quantitative variables. Be able t Make a graph (scatterplt) Summarize the

More information

CHAPTER 3 INEQUALITIES. Copyright -The Institute of Chartered Accountants of India

CHAPTER 3 INEQUALITIES. Copyright -The Institute of Chartered Accountants of India CHAPTER 3 INEQUALITIES Cpyright -The Institute f Chartered Accuntants f India INEQUALITIES LEARNING OBJECTIVES One f the widely used decisin making prblems, nwadays, is t decide n the ptimal mix f scarce

More information

ENSC Discrete Time Systems. Project Outline. Semester

ENSC Discrete Time Systems. Project Outline. Semester ENSC 49 - iscrete Time Systems Prject Outline Semester 006-1. Objectives The gal f the prject is t design a channel fading simulatr. Upn successful cmpletin f the prject, yu will reinfrce yur understanding

More information

Emphases in Common Core Standards for Mathematical Content Kindergarten High School

Emphases in Common Core Standards for Mathematical Content Kindergarten High School Emphases in Cmmn Cre Standards fr Mathematical Cntent Kindergarten High Schl Cntent Emphases by Cluster March 12, 2012 Describes cntent emphases in the standards at the cluster level fr each grade. These

More information

Stats Classification Ji Zhu, Michigan Statistics 1. Classification. Ji Zhu 445C West Hall

Stats Classification Ji Zhu, Michigan Statistics 1. Classification. Ji Zhu 445C West Hall Stats 415 - Classificatin Ji Zhu, Michigan Statistics 1 Classificatin Ji Zhu 445C West Hall 734-936-2577 jizhu@umich.edu Stats 415 - Classificatin Ji Zhu, Michigan Statistics 2 Examples f Classificatin

More information

B. Definition of an exponential

B. Definition of an exponential Expnents and Lgarithms Chapter IV - Expnents and Lgarithms A. Intrductin Starting with additin and defining the ntatins fr subtractin, multiplicatin and divisin, we discvered negative numbers and fractins.

More information

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9. Sectin 7 Mdel Assessment This sectin is based n Stck and Watsn s Chapter 9. Internal vs. external validity Internal validity refers t whether the analysis is valid fr the ppulatin and sample being studied.

More information

Linear Classification

Linear Classification Linear Classificatin CS 54: Machine Learning Slides adapted frm Lee Cper, Jydeep Ghsh, and Sham Kakade Review: Linear Regressin CS 54 [Spring 07] - H Regressin Given an input vectr x T = (x, x,, xp), we

More information

Math Foundations 10 Work Plan

Math Foundations 10 Work Plan Math Fundatins 10 Wrk Plan Units / Tpics 10.1 Demnstrate understanding f factrs f whle numbers by: Prime factrs Greatest Cmmn Factrs (GCF) Least Cmmn Multiple (LCM) Principal square rt Cube rt Time Frame

More information

Elements of Machine Intelligence - I

Elements of Machine Intelligence - I ECE-175A Elements f Machine Intelligence - I Ken Kreutz-Delgad Nun Vascncels ECE Department, UCSD Winter 2011 The curse The curse will cver basic, but imprtant, aspects f machine learning and pattern recgnitin

More information

NUMBERS, MATHEMATICS AND EQUATIONS

NUMBERS, MATHEMATICS AND EQUATIONS AUSTRALIAN CURRICULUM PHYSICS GETTING STARTED WITH PHYSICS NUMBERS, MATHEMATICS AND EQUATIONS An integral part t the understanding f ur physical wrld is the use f mathematical mdels which can be used t

More information

Statistical Learning. 2.1 What Is Statistical Learning?

Statistical Learning. 2.1 What Is Statistical Learning? 2 Statistical Learning 2.1 What Is Statistical Learning? In rder t mtivate ur study f statistical learning, we begin with a simple example. Suppse that we are statistical cnsultants hired by a client t

More information

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa There are tw parts t this lab. The first is intended t demnstrate hw t request and interpret the spatial diagnstics f a standard OLS regressin mdel using GeDa. The diagnstics prvide infrmatin abut the

More information

Preparation work for A2 Mathematics [2017]

Preparation work for A2 Mathematics [2017] Preparatin wrk fr A2 Mathematics [2017] The wrk studied in Y12 after the return frm study leave is frm the Cre 3 mdule f the A2 Mathematics curse. This wrk will nly be reviewed during Year 13, it will

More information

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical mdel fr micrarray data analysis David Rssell Department f Bistatistics M.D. Andersn Cancer Center, Hustn, TX 77030, USA rsselldavid@gmail.cm

More information

The standards are taught in the following sequence.

The standards are taught in the following sequence. B L U E V A L L E Y D I S T R I C T C U R R I C U L U M MATHEMATICS Third Grade In grade 3, instructinal time shuld fcus n fur critical areas: (1) develping understanding f multiplicatin and divisin and

More information

We can see from the graph above that the intersection is, i.e., [ ).

We can see from the graph above that the intersection is, i.e., [ ). MTH 111 Cllege Algebra Lecture Ntes July 2, 2014 Functin Arithmetic: With nt t much difficulty, we ntice that inputs f functins are numbers, and utputs f functins are numbers. S whatever we can d with

More information

Eric Klein and Ning Sa

Eric Klein and Ning Sa Week 12. Statistical Appraches t Netwrks: p1 and p* Wasserman and Faust Chapter 15: Statistical Analysis f Single Relatinal Netwrks There are fur tasks in psitinal analysis: 1) Define Equivalence 2) Measure

More information

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint Biplts in Practice MICHAEL GREENACRE Prfessr f Statistics at the Pmpeu Fabra University Chapter 13 Offprint CASE STUDY BIOMEDICINE Cmparing Cancer Types Accrding t Gene Epressin Arrays First published:

More information

Contents. This is page i Printer: Opaque this

Contents. This is page i Printer: Opaque this Cntents This is page i Printer: Opaque this Supprt Vectr Machines and Flexible Discriminants. Intrductin............. The Supprt Vectr Classifier.... Cmputing the Supprt Vectr Classifier........ Mixture

More information

Preparation work for A2 Mathematics [2018]

Preparation work for A2 Mathematics [2018] Preparatin wrk fr A Mathematics [018] The wrk studied in Y1 will frm the fundatins n which will build upn in Year 13. It will nly be reviewed during Year 13, it will nt be retaught. This is t allw time

More information

[COLLEGE ALGEBRA EXAM I REVIEW TOPICS] ( u s e t h i s t o m a k e s u r e y o u a r e r e a d y )

[COLLEGE ALGEBRA EXAM I REVIEW TOPICS] ( u s e t h i s t o m a k e s u r e y o u a r e r e a d y ) (Abut the final) [COLLEGE ALGEBRA EXAM I REVIEW TOPICS] ( u s e t h i s t m a k e s u r e y u a r e r e a d y ) The department writes the final exam s I dn't really knw what's n it and I can't very well

More information

Medium Scale Integrated (MSI) devices [Sections 2.9 and 2.10]

Medium Scale Integrated (MSI) devices [Sections 2.9 and 2.10] EECS 270, Winter 2017, Lecture 3 Page 1 f 6 Medium Scale Integrated (MSI) devices [Sectins 2.9 and 2.10] As we ve seen, it s smetimes nt reasnable t d all the design wrk at the gate-level smetimes we just

More information

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007 CS 477/677 Analysis f Algrithms Fall 2007 Dr. Gerge Bebis Curse Prject Due Date: 11/29/2007 Part1: Cmparisn f Srting Algrithms (70% f the prject grade) The bjective f the first part f the assignment is

More information

Math Foundations 20 Work Plan

Math Foundations 20 Work Plan Math Fundatins 20 Wrk Plan Units / Tpics 20.8 Demnstrate understanding f systems f linear inequalities in tw variables. Time Frame December 1-3 weeks 6-10 Majr Learning Indicatrs Identify situatins relevant

More information

MATHEMATICS SYLLABUS SECONDARY 5th YEAR

MATHEMATICS SYLLABUS SECONDARY 5th YEAR Eurpean Schls Office f the Secretary-General Pedaggical Develpment Unit Ref. : 011-01-D-8-en- Orig. : EN MATHEMATICS SYLLABUS SECONDARY 5th YEAR 6 perid/week curse APPROVED BY THE JOINT TEACHING COMMITTEE

More information

Methods for Determination of Mean Speckle Size in Simulated Speckle Pattern

Methods for Determination of Mean Speckle Size in Simulated Speckle Pattern 0.478/msr-04-004 MEASUREMENT SCENCE REVEW, Vlume 4, N. 3, 04 Methds fr Determinatin f Mean Speckle Size in Simulated Speckle Pattern. Hamarvá, P. Šmíd, P. Hrváth, M. Hrabvský nstitute f Physics f the Academy

More information

THE LIFE OF AN OBJECT IT SYSTEMS

THE LIFE OF AN OBJECT IT SYSTEMS THE LIFE OF AN OBJECT IT SYSTEMS Persns, bjects, r cncepts frm the real wrld, which we mdel as bjects in the IT system, have "lives". Actually, they have tw lives; the riginal in the real wrld has a life,

More information

Computational modeling techniques

Computational modeling techniques Cmputatinal mdeling techniques Lecture 2: Mdeling change. In Petre Department f IT, Åb Akademi http://users.ab.fi/ipetre/cmpmd/ Cntent f the lecture Basic paradigm f mdeling change Examples Linear dynamical

More information

22.54 Neutron Interactions and Applications (Spring 2004) Chapter 11 (3/11/04) Neutron Diffusion

22.54 Neutron Interactions and Applications (Spring 2004) Chapter 11 (3/11/04) Neutron Diffusion .54 Neutrn Interactins and Applicatins (Spring 004) Chapter (3//04) Neutrn Diffusin References -- J. R. Lamarsh, Intrductin t Nuclear Reactr Thery (Addisn-Wesley, Reading, 966) T study neutrn diffusin

More information

1996 Engineering Systems Design and Analysis Conference, Montpellier, France, July 1-4, 1996, Vol. 7, pp

1996 Engineering Systems Design and Analysis Conference, Montpellier, France, July 1-4, 1996, Vol. 7, pp THE POWER AND LIMIT OF NEURAL NETWORKS T. Y. Lin Department f Mathematics and Cmputer Science San Jse State University San Jse, Califrnia 959-003 tylin@cs.ssu.edu and Bereley Initiative in Sft Cmputing*

More information

Modelling of Clock Behaviour. Don Percival. Applied Physics Laboratory University of Washington Seattle, Washington, USA

Modelling of Clock Behaviour. Don Percival. Applied Physics Laboratory University of Washington Seattle, Washington, USA Mdelling f Clck Behaviur Dn Percival Applied Physics Labratry University f Washingtn Seattle, Washingtn, USA verheads and paper fr talk available at http://faculty.washingtn.edu/dbp/talks.html 1 Overview

More information

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers LHS Mathematics Department Hnrs Pre-alculus Final Eam nswers Part Shrt Prblems The table at the right gives the ppulatin f Massachusetts ver the past several decades Using an epnential mdel, predict the

More information

Admissibility Conditions and Asymptotic Behavior of Strongly Regular Graphs

Admissibility Conditions and Asymptotic Behavior of Strongly Regular Graphs Admissibility Cnditins and Asympttic Behavir f Strngly Regular Graphs VASCO MOÇO MANO Department f Mathematics University f Prt Oprt PORTUGAL vascmcman@gmailcm LUÍS ANTÓNIO DE ALMEIDA VIEIRA Department

More information

Weathering. Title: Chemical and Mechanical Weathering. Grade Level: Subject/Content: Earth and Space Science

Weathering. Title: Chemical and Mechanical Weathering. Grade Level: Subject/Content: Earth and Space Science Weathering Title: Chemical and Mechanical Weathering Grade Level: 9-12 Subject/Cntent: Earth and Space Science Summary f Lessn: Students will test hw chemical and mechanical weathering can affect a rck

More information

Inference in the Multiple-Regression

Inference in the Multiple-Regression Sectin 5 Mdel Inference in the Multiple-Regressin Kinds f hypthesis tests in a multiple regressin There are several distinct kinds f hypthesis tests we can run in a multiple regressin. Suppse that amng

More information

Physics 2B Chapter 23 Notes - Faraday s Law & Inductors Spring 2018

Physics 2B Chapter 23 Notes - Faraday s Law & Inductors Spring 2018 Michael Faraday lived in the Lndn area frm 1791 t 1867. He was 29 years ld when Hand Oersted, in 1820, accidentally discvered that electric current creates magnetic field. Thrugh empirical bservatin and

More information

Dead-beat controller design

Dead-beat controller design J. Hetthéssy, A. Barta, R. Bars: Dead beat cntrller design Nvember, 4 Dead-beat cntrller design In sampled data cntrl systems the cntrller is realised by an intelligent device, typically by a PLC (Prgrammable

More information

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation III-l III. A New Evaluatin Measure J. Jiner and L. Werner Abstract The prblems f evaluatin and the needed criteria f evaluatin measures in the SMART system f infrmatin retrieval are reviewed and discussed.

More information

Subject description processes

Subject description processes Subject representatin 6.1.2. Subject descriptin prcesses Overview Fur majr prcesses r areas f practice fr representing subjects are classificatin, subject catalging, indexing, and abstracting. The prcesses

More information

The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition

The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition The Kullback-Leibler Kernel as a Framewrk fr Discriminant and Lcalized Representatins fr Visual Recgnitin Nun Vascncels Purdy H Pedr Mren ECE Department University f Califrnia, San Dieg HP Labs Cambridge

More information

Computational modeling techniques

Computational modeling techniques Cmputatinal mdeling techniques Lecture 4: Mdel checing fr ODE mdels In Petre Department f IT, Åb Aademi http://www.users.ab.fi/ipetre/cmpmd/ Cntent Stichimetric matrix Calculating the mass cnservatin relatins

More information

Lecture 8: Multiclass Classification (I)

Lecture 8: Multiclass Classification (I) Bayes Rule fr Multiclass Prblems Traditinal Methds fr Multiclass Prblems Linear Regressin Mdels Lecture 8: Multiclass Classificatin (I) Ha Helen Zhang Fall 07 Ha Helen Zhang Lecture 8: Multiclass Classificatin

More information

Department of Electrical Engineering, University of Waterloo. Introduction

Department of Electrical Engineering, University of Waterloo. Introduction Sectin 4: Sequential Circuits Majr Tpics Types f sequential circuits Flip-flps Analysis f clcked sequential circuits Mre and Mealy machines Design f clcked sequential circuits State transitin design methd

More information

Lecture 3: Principal Components Analysis (PCA)

Lecture 3: Principal Components Analysis (PCA) Lecture 3: Principal Cmpnents Analysis (PCA) Reading: Sectins 6.3.1, 10.1, 10.2, 10.4 STATS 202: Data mining and analysis Jnathan Taylr, 9/28 Slide credits: Sergi Bacallad 1 / 24 The bias variance decmpsitin

More information

Comparing Several Means: ANOVA. Group Means and Grand Mean

Comparing Several Means: ANOVA. Group Means and Grand Mean STAT 511 ANOVA and Regressin 1 Cmparing Several Means: ANOVA Slide 1 Blue Lake snap beans were grwn in 12 pen-tp chambers which are subject t 4 treatments 3 each with O 3 and SO 2 present/absent. The ttal

More information

Relationship Between Amplifier Settling Time and Pole-Zero Placements for Second-Order Systems *

Relationship Between Amplifier Settling Time and Pole-Zero Placements for Second-Order Systems * Relatinship Between Amplifier Settling Time and Ple-Zer Placements fr Secnd-Order Systems * Mark E. Schlarmann and Randall L. Geiger Iwa State University Electrical and Cmputer Engineering Department Ames,

More information

Support Vector Machines and Flexible Discriminants

Support Vector Machines and Flexible Discriminants Supprt Vectr Machines and Flexible Discriminants This is page Printer: Opaque this. Intrductin In this chapter we describe generalizatins f linear decisin bundaries fr classificatin. Optimal separating

More information

1 The limitations of Hartree Fock approximation

1 The limitations of Hartree Fock approximation Chapter: Pst-Hartree Fck Methds - I The limitatins f Hartree Fck apprximatin The n electrn single determinant Hartree Fck wave functin is the variatinal best amng all pssible n electrn single determinants

More information

February 28, 2013 COMMENTS ON DIFFUSION, DIFFUSIVITY AND DERIVATION OF HYPERBOLIC EQUATIONS DESCRIBING THE DIFFUSION PHENOMENA

February 28, 2013 COMMENTS ON DIFFUSION, DIFFUSIVITY AND DERIVATION OF HYPERBOLIC EQUATIONS DESCRIBING THE DIFFUSION PHENOMENA February 28, 2013 COMMENTS ON DIFFUSION, DIFFUSIVITY AND DERIVATION OF HYPERBOLIC EQUATIONS DESCRIBING THE DIFFUSION PHENOMENA Mental Experiment regarding 1D randm walk Cnsider a cntainer f gas in thermal

More information

Localized Model Selection for Regression

Localized Model Selection for Regression Lcalized Mdel Selectin fr Regressin Yuhng Yang Schl f Statistics University f Minnesta Church Street S.E. Minneaplis, MN 5555 May 7, 007 Abstract Research n mdel/prcedure selectin has fcused n selecting

More information