ECE 592 Topics in Data Science

Size: px

Start display at page:

Download "ECE 592 Topics in Data Science"

Tyrone Bruce
5 years ago
Views:

1 ECE 592 Topics in Data Science Dror Baron Associate Professor Dept. of Electrical and Computer Engr. North Carolina State University, NC, USA

2 Two Classification Examples [Hastie et al., Chapter ] Keywords: classification, least squares, linear models, nearest neighbors

3 Classification problem Simple binary classification problem (code online) Training samples in 2-dimensional space (200 samples) Each sample labeled blue/red (book uses green/red) Not (yet!) revealing how samples were generated Goal for test data, will predict class (blue/red)

4 Is goal realistic? Structure of classes not clear Some red/blue samples almost overlap Want two conflicting goals: Stable classification method doesn t jump around lots Low-bias method usually accurate

5 Nearest neighbors approach For each test point, find K nearest neighbors among training samples Classes (blue/red) of K nearest neighbors are known Classify using majority vote How big should K be? 5

6 Nearest neighbors results (K=50) Begin with large K=50 (recall 200 training samples) Each test sample classified using lots of neighbors Very stable (large areas of red and blue in classifier output) Also very noisy (fails to adapt to small patches of other color)

7 Nearest neighbors results (K=5) Smaller K=5 Less stable (areas of blue and red change quickly) Seems to classify more precisely Maybe we should use minimal K=1?

8 Nearest neighbor (K=1) Unstable (often have singleton samples) Classifies training data well but maybe prone to noise? Moderate K offers best stability/bias trade-off

9 How can we choose K automatically? Optimal value of K depends on: More randomness greater K for confident majority vote High dimensional data affects neighborliness Known as curse of dimensionality Large N lots of computation to find nearest neighbors Can select good K w/validation approach Partition into training/test data Validate quality on test data after training w/different K 9

10 Linear model pp Consider model for data YY = ββ 0 + jj=1 # dimensions p=2 X j coordinate along dimension j XX jj ββ jj Estimated linear coefficient ββ jj First coefficient ββ 0 corresponds to intercept or bias Why is linear model commonly used? Relatively simple (tractable) Often works well Many systems well-approximated as linear lots of machinery developed for this 10

11 How do we compute coefficients? Let s simplify notation: YY =X T ββ Interpret as linear function, f X = XX TT ββ Least squares method [Gauss, 1795] NN Minimize sum of square errors Error(ββ)= nn=1 yy nn xx TT nn ββ 2 Express in vector form Error(ββ)= yy XXββ TT yy XXββ X is N*p matrix, y is N-vector of training classes Want to minimize Error(ββ) Take derivative with respect to (w.r.t.) ββ XX TT yy XXββ = 0 (zero derivative) XX TT yy = XX TT XXββ Solution (pseudo-inverse): ββ = XX TT XX 1 XX TT yy 11

12 Linear classifier results We have two classification regions red/blue Straight line separates between them Very stable; terrible bias

13 Discussion Linear model very stable & biased Nearest neighbors less stable & less biased Need moderate value of K Could interpret nearest neighbors as better Suffers from curse of dimensionality Scarcity of neighbors in high dimensions 13

14 Decision Theory [Hastie et al., Section 2.4] Keywords: decision theory

15 Decision theory More general problem minimize loss function Square error loss function NN We ve seen Error(ββ)= nn=1 yy nn xx TT nn ββ 2 Emphasizes/accentuates larger errors Bayesian case (known statistics) conditional expectation YY =E[Y X] Absolute error NN Our example: Error(ββ)= nn=1 yy nn xx TT nn ββ Smaller errors receive more weight Leads to median estimator; typically viewed as more robust Bayesian binary classification selects most probable class 15

16 Example Consider Y={2,1,1.5,2,17} Let s show for Error ββ = YY 2 2 that average ββ = 4.7 minimizes error (details in supplements) Will show for Error ββ = YY 1 that median ββ = 2 minimizes error (details in supplements) 16

17 Clustering (Related to Project 1) [Hastie et al., Section 14.3] Keywords: clustering, unsupervised learning

18 Goals Want to group data into clusters that seem related Central notion degree of similarity between different clusters Typical algorithmic approach is iterative; move points between clusters, recalculate cluster centers 18

19 K means algorithm Initialize K cluster centers Select K points among training data Iterate until convergence: Associate each training datum with nearest cluster center Recompute cluster centers as average of training data in cluster Sensitive to initialization (can get stuck in local optimum) Other clustering algos use model for cluster 19

20 More about K means Map datum x n to cluster C(n)=k to representation level r k, k=k(n)=c(n) Squared error between x n and r k dd xx nn, rr kk) = PP pp=1 (xx nnnn rrrrrr) 2 = xx nn rr kk 2 Want rr kk = min xx R PP {nn:cc nn =kk} xx nn rrrr 2 Select cluster center, rr kk = Summed square errors for mapping C Error CC = NN 2 nn=1 xx nn rr kk(nn) 1 {nn:cc nn =kk} {nn:cc nn =kk} xx nn sum over N data = KK kk=1 {nn:cc nn =kk} xx nn rr 2 kk sum over K clusters 20

21 Linear Regression [Hastie et al., Chapter 3] Keywords: linear models, regression

22 Recall linear model pp Linear model for data YY = ββ 0 + jj=1 XX jj ββ jj Linear model is tractable, often works well, and many systems approximately linear Favorable feedback loop between math/algorithms for linear models and hardware w/ linear properties 22

23 Meaning of X? pp YY = ββ 0 + jj=1 XX jj ββ jj Variables X j could be: 1. Actual inputs (example: data about patients) 2. Function of inputs (example: log, square root) 3. Polynomials of inputs Taylor series approximation 4. Codes of classes (example: red=1, green=2, blue=3) 5. Interactions between variables (example: X 3 =X 1 X 2 ) pp Function ff(xx) = ββ 0 + jj=1 underlying factors XX jj ββ jj could be nonlinear in 23

24 Classical solution Least squares method [Gauss, 1795] NN Minimize sum of square errors Error(ββ)= nn=1 yy nn xx TT nn ββ 2 Closed form solution: ββ = XX TT XX 1 XX TT yy Pseudo-inverse XX + = XX TT XX 1 XX TT Geometric interpretation Interpret inputs as matrix, X={1, X 1, X 2,, X p } Consider vector space spanned by columns of X Project Y to space, forming YY = XX + YY Residual Y =Y- YY orthogonal to space 24

25 Real world example Consider financial prediction Y n = future % change in price of stock n {1,,N} X n1 = earnings / price High is good, want high earnings/profits X n2 = book / price Book value = value of company on paper; high is good X n3 = % price change last 6 months Price momentum: stocks that went up keep going up (on average) Can have (many) more variables X Goal: predict future returns (% price change) from data 25

26 Example continued How to predict future returns? pp Form linear model for returns YY nn = ββ 0 + jj=1 XX nnnn ββ jj Compute parameters β Can use least squares Apply parameters for linear model to predict (future) returns 26

27 Multivariate and Univariate Regression [Hastie et al., Chapter 3.3] Keywords: multivariate linear regression, orthogonalization, univariate linear regression

28 Start with univariate Univariate linear model, Y=Xβ+ε Keep it simple - no intercept Will show ββ = nn=1 NN NN nn=1 xx nn yy nn (xxxx) 2 (details in supplements) To do so, use vector notation: x = vector of x n y = vector of y n Inner product <x,y>= NN nn=1 xx nn yy nn ββ = <xx,yy> <xx,xx> Can compute residual (error), r=y-x ββ 28

29 From univariate to multivariate Suppose that columns x 1,, x p orthogonal Orthogonal means <x j,x k >=0 for j k Can show ββ ii = <xxxx,yy> <xxxx,xxxx> (for orthogonal case) Challenge: real data isn t orthogonal Solution: orthogonalize it (e.g. Gram-Schmidt procedure) But orthogonalization can be numerically sensitive 29

30 Subset Selection [Hastie et al., Chapter 3.4] Keywords: feature selection, subset selection

31 Some problems with least squares Least squares often outputs entirely nonzero vector Dense (non-sparse) output corresponds to complicated linear model More convenient/insightful for human operator to interpret simpler model (humans comfortable w/5-10 variables, not 500) Slight loss of accuracy with sparse output may be preferred Least squares doesn t use possible knowledge about structure of β Example structures: sparse β, discrete valued β, non-negative β, Least squares (multiplication by pseudo-inverse) ignores these 31

Example where least squares fails Consider N users each with phone Most users are inactive K<<N active users transmit {-1,+1} Communication channel often modeled as linear

32 Example where least squares fails Consider N users each with phone Most users are inactive K<<N active users transmit {-1,+1} Communication channel often modeled as linear Y=HX+Noise; X comprised of {-1,0 (inactive),+1} Least squares multiplies Y by pseudo-inverse H + H + Y contains elements not in {-1,0,+1} But we want output to be in {-1,0,+1} 32

33 Subset selection Want sparse β select subset of columns of matrix X that approximate Y well Called subset selection Requires combinatorial complexity in general Various techniques provide faster runtime May approach best solution but not find it Will study fast algorithms for this; but they require elements of X to be independent and identically distributed (i.i.d.) 33

Another real world example Medical setting Columns of X are measurements relating to patients Examples: height, blood pressure, heart problems in family, Y is medical outcome we re trying to predict

34 Another real world example Medical setting Columns of X are measurements relating to patients Examples: height, blood pressure, heart problems in family, Y is medical outcome we re trying to predict Example: blood pressure, sugar level, future treatment costs, Want to understand relations between conditions, possibly saving $ Lots of columns (dozens, hundreds, ) challenging for physician to comprehend linear model Subset selection model based on few terms simpler for human to work with 34

35 Shrinkage [Hastie et al., Chapter 3.4.3] Keywords: LASSO, ridge regression, shrinkage

36 Shrinkage and ridge regression Subset selection throws out some columns may have large errors Shrinkage methods modify β more subtly They shrink elements of β toward zero Ridge regression penalizes large β ββ rrrrrrrrrr = argmin ββ yy XXββ 2 + λλ ββ 2 Can be implemented with modified pseudo inverse, ββ rrrrrrrrrr =(X T X+λI) -1 X T y Identity matrix I new term is l 2 norm 36

ββ 1 tt Can express w/lagrange multiplier λ: argmin ββ Small t drives many coefficients to

37 LASSO Least absolute shrinkage and selection operator (LASSO) ββ llllllllll = argmin ββ yy XXββ 2 s.t. ββ 1 tt Can express w/lagrange multiplier λ: argmin ββ Small t drives many coefficients to zero yy XXββ 2 +λ ββ 1 Various fast algorithms implement LASSO Gradient projection for sparse reconstruction (GPSR) [Nowak et al.] Approximate message passing (AMP) [Donoho et al.] 37

38 Example: Estimating Sparse Inputs [Sarvotham, B, & Baraniuk, Sudocodes - Fast Measurement and Reconstruction of Sparse Signals, Int. Symp. Info. Theory, 2006] Keywords: sudocodes

39 How can we estimate sparse β? LASSO can drive many coeffs to zero; reasonable? Let s revisit linear problem, Y=Xβ, and assume: Sparse β (mostly zeros; will justify later in course) Real valued β No noise Linear algebra 101: X must contain at least as many observations as unknowns Incorrect when β contains structure (e.g., sparsity)! 39

40 Example Y X β ??????

41 Example What does zero measurement imply? Hint: sparse and real-valued β Y X β ??????

42 Example Graph reduction! Y X β ? 0 0???

43 Example What do matching measurements imply? Hint: non-zeros in x are real numbers Y X β ? 0 0???

44 Example What is the last entry of x? ?

And more Resembles solving Sudoku puzzles BUT Sparse β

example by evaluating almostmatching measurements

with Noisy- Sudocodes, IEEE Trans. Signal Proc.

45 And more Resembles solving Sudoku puzzles BUT Sparse β No noise Binary matrix X Can expand to noisy data, for example by evaluating almostmatching measurements Details in [Ma, B, & Needell, Two-Part Reconstruction with Noisy- Sudocodes, IEEE Trans. Signal Proc., 2014] Will study more powerful algorithms using sparse signal processing 45

46 Linear Methods for Classification [Hastie et al., Chapter 4] Keywords: decision boundaries, linear discriminant analysis, logistic regression

47 Classification problem Data X takes values in class G={1,,K} Want to predict G(X) Main idea: Create affine function f k (X)= β 0 + β T X, k {1,,K} Affine linear, need not go through origin Note that β depends on k Predict GG XX = argmax kk 1,,KK ff kk (XX) Decision boundary between classes f k (X)=f l (X) 47

48 Illustration [Hastie et al. Fig. 4.1] Data from K=3 classes illustrated by colored numbers Linear boundaries Computed by linear discriminant analysis (LDA) LDA details later Quadratic boundaries Add cross terms (X 1 X 2, (X 1 ) 2, (X 2 ) 2 ) to space Run LDA 48

49 Linear regression of indicator matrix Define K indicator variables 1, GG XX = kk YY kk XX =, kk {1,, KK} 0, eeeeeeee Form K variables into vector, Y=(Y 1,,Y K ) N training instances form N K indicator matrix Train linear regression on indicator matrix YY = XXXX + YY Pseudo-inverse XX + = XX TT XX 1 XX TT First column of X will contains ones (bias) 49

50 How is this useful? Classify: GG XX = argmax kk 1,,KK ff kk (XX) f k (X) uses linear regression Interpretation: f k (X) resembles Pr(G(X)=k X) Due to E[indicator]=Probability In general, as N increases will add more basis expansions (e.g., polynomial terms) and run regression on those 50

51 Limitation Classes can be masked by others Here we have K=3 classes; but only 2 recognized In general order-(k-1) polynomial corrects this 51

52 Linear Discriminant Analysis [Hastie et al., Section 4.3] Keywords: linear discriminant analysis, quadratic discriminant analysis

53 Bayesian approach to classification Consider f k (X) being posterior densities, f k (X)=f(X k) Class k has probability π k 0 (Σ k π k =1) Bayes theorem: Pr kk XX = = = KK kk=1 KK kk=1 ππ kk ff(xx kk) KK kk`=1 ππ kk` ff(xx kk`) ππ kk ffff(xx) ππ kk` ffff (XX) Pr(XX,kk) Pr(XX,kkk) Need statistical model for posterior densities 53

54 Example (Question 3, practice midterm 2016) Two classes with pdfs f blue =0.5NN(-1,1)+0.5 NN(2,1) and f red = NN(0,1); Pr(blue)=Pr(red)=0.5 Means of Gaussian components differ Variances of all Gaussian components are 1 For blue class, 0.5 = probabilities of Gaussian components Let s derive Bayesian classification approach Will work out; detailed solution in supplement 54

55 What posterior model to use? IF we know true f k (X)=f(X k), we re done BIG IF Possible models: Gaussian densities (linear discriminant analysis) Gaussian mixtures nonlinear decision boundaries Nonparametric densities Naïve Bayes (separable pdf) 55

56 Multivariate Gaussian posterior Choose multivariate Gaussian density ff kk XX = 1 Cluster mean μμ kk (2ππ) pp Σ kk exp 1 2 XX μμ kk TT Σ kk 1 XX μμ kk Covariance matrix Σ kk Entry ij captures covariance between elements i & j Assume same covariance matrix all classes, ΣΣ kk = ΣΣ Will revisit assumption soon 56

57 Interpretation Each class has Gaussian centered around µ k Elliptical level set (contour) for Gaussian component Structure of ellipse depends on Σ 57

58 Linear discriminant analysis (LDA) Can show log Pr(GG=kk XX) Pr(GG=ll XX) = log ππ kk ππ ll 1 2 μμ kk + μμ ll TT Σ 1 μμ kk μμ ll + xxxxσ 1 μμ kk μμ ll Linear decision boundary Classification: Linear discriminant function δδ kk XX = log ππ kk + xx TT Σ 1 μμ kk 1 2 μμ kk TT Σ 1 μμ kk Choose k that maximizes δδ kk XX In practice, will learn Σ, µ k, π k Closely related to linear regression approach 58

59 Quadratic discriminant analysis (QDA) What if covariance matrices ΣΣ kk are different? Each class has different elliptical set Decision boundaries become quadratic QDA requires more parameters than LDA needs somewhat larger N to work well In general, LDA and QDA have good performance Data seldom support complicated decision rules LDA and QDA yield robust/stable boundaries 59

60 Logistic Regression [Hastie et al., Section 4.4] Keywords: logistic regression

61 Motivation Want to model posteriors linearly log Pr(GG=ll XX) Pr(GG=KK XX) = ββ lll +ββ ll TT XX, class l=0,1,, K-1 Can show: Pr G = l X = Pr G = K X = TT exp(ββ lll +ββ ll XX ) KK 1 exp(ββ mm0 +ββ TT mm XX ) 1+ mm=1 1 KK 1 exp(ββ mm0 +ββ TT mm XX ) 1+ mm=1 61

62 What s the hard part? Hard part is learning the parameters Book describes some details Left for possible final project 62

63 Basis Expansions [Hastie et al., Section ] Keywords: basis expansion, regularization, splines, Taylor approximation

64 Why linear models? So far have discussed linear models quite a bit Linear first order Taylor approximation Hard to fit more complex functions when N small or P large But real-world functions/data aren t linear 64

65 Augmented models Begin with vector of inputs (X 1,, X P ) Augment/replace inputs with transformations of X Transformation h m (X): R P R Linear basis expansion ff XX = MM mm=1 ββ mm h mm (XX) Will fit parameters β m 65

66 Example basis expansions 1. h m (X)=X m original linear model 2. h m (X)=(X j ) 2 or X j X k second (or higher) order Taylor Beware growth in number of terms (need to regularize) 3. Piecewise polynomials/splines Control complexity using variable selection or regularization 66

67 Why piecewise polynomials? Global fit fluctuates wildly Piecewise fits for local structure Enforce continuity Constraints reduce degrees of freedom 67

68 Common approach: Cubic splines Fits well within region Fluctuates even more elsewhere Also have multi dimensional splines, wavelets 68

69 Real world application Recall financial example X contains financial data Prices, firm profits, sales, Linear model for future prices may be restrictive Augment X with basis functions possibly better fit 69

70 Kernel Methods [Hastie et al., Section ] Keywords: kernel method

71 Recall nearest neighbors ff XX = Average(YYYY XXXX NNNN(XX)) N K (X) set of K nearest neighbors Yellow rectangle: height represents weights (1/K) Points leave/enter N K (X) NN estimator is bumpy 71

72 Key idea Maybe give different weights to XX ii NNNN(XX)? Assign greater confidence (weight) to nearer points Taper away for distant points ff XX 0 = nn=1 NN KK λλ XX 0,XX nn YY nn NN KK λλ (XX 0,XX nn ) nn=1 KK λλ XX 0, XX = DD XX XX 0 λλ Width parameter λλ Can choose automatically 72

73 Nearest neighbors (NN) for regression In the past we discussed nearest neighbors (NN) for classification using majority/plurality votes Can also be used for regression Perform locally-linear fit 73

74 Support Vector Machines [Hastie et al., Sections ] Keywords: separating hyperplane, support vector machines

75 From linear to nonlinear classifiers Main idea: find hyperplane that best-separates two classes of data Can seldom separate all the data Allow erroneous slack variables Select points locally (w/kernel?) Nonlinear mapping to higher dimension 75

76 Simple case How do we compute perfectly separating hyperplane? Hyperplane = {X: f(x)=x T β+β 0 =0}, β 0 R, β R P Classifier G(X)=sign{X T β+β 0 } Perfect classification Y n f(x n )>0, n Best hyperplane maximizes margin M max MMsubject to YY nn XX TT nn ββ + ββ 0 MM, nn ββ,ββ 0, ββ =1 Rewrite: min ββ subject to YY nn XX TT nn ββ + ββ 0 1, nn ββ,ββ 0 76

77 Slack variables Perfectly separating hyperplane may not exist Require YY nn XX TT nn ββ + ββ 0 MM 1 ξξ nn, nn Slack variables {ξξ nn } should be 0, have small l 1 norm Details in book 77

PP to high-dimensional space Kernels, splines,

78 SVM + non-linear mappings Perfectly separating hyperplane may exist in highdimensional space Map R PP to high-dimensional space Kernels, splines, polynomials Find hyperplane w/best margin in transformed space 78

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Support Vector Machines CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 A Linearly Separable Problem Consider the binary classification