ECE 592 Topics in Data Science
|
|
- Tyrone Bruce
- 5 years ago
- Views:
Transcription
1 ECE 592 Topics in Data Science Dror Baron Associate Professor Dept. of Electrical and Computer Engr. North Carolina State University, NC, USA
2 Two Classification Examples [Hastie et al., Chapter ] Keywords: classification, least squares, linear models, nearest neighbors
3 Classification problem Simple binary classification problem (code online) Training samples in 2-dimensional space (200 samples) Each sample labeled blue/red (book uses green/red) Not (yet!) revealing how samples were generated Goal for test data, will predict class (blue/red)
4 Is goal realistic? Structure of classes not clear Some red/blue samples almost overlap Want two conflicting goals: Stable classification method doesn t jump around lots Low-bias method usually accurate
5 Nearest neighbors approach For each test point, find K nearest neighbors among training samples Classes (blue/red) of K nearest neighbors are known Classify using majority vote How big should K be? 5
6 Nearest neighbors results (K=50) Begin with large K=50 (recall 200 training samples) Each test sample classified using lots of neighbors Very stable (large areas of red and blue in classifier output) Also very noisy (fails to adapt to small patches of other color)
7 Nearest neighbors results (K=5) Smaller K=5 Less stable (areas of blue and red change quickly) Seems to classify more precisely Maybe we should use minimal K=1?
8 Nearest neighbor (K=1) Unstable (often have singleton samples) Classifies training data well but maybe prone to noise? Moderate K offers best stability/bias trade-off
9 How can we choose K automatically? Optimal value of K depends on: More randomness greater K for confident majority vote High dimensional data affects neighborliness Known as curse of dimensionality Large N lots of computation to find nearest neighbors Can select good K w/validation approach Partition into training/test data Validate quality on test data after training w/different K 9
10 Linear model pp Consider model for data YY = ββ 0 + jj=1 # dimensions p=2 X j coordinate along dimension j XX jj ββ jj Estimated linear coefficient ββ jj First coefficient ββ 0 corresponds to intercept or bias Why is linear model commonly used? Relatively simple (tractable) Often works well Many systems well-approximated as linear lots of machinery developed for this 10
11 How do we compute coefficients? Let s simplify notation: YY =X T ββ Interpret as linear function, f X = XX TT ββ Least squares method [Gauss, 1795] NN Minimize sum of square errors Error(ββ)= nn=1 yy nn xx TT nn ββ 2 Express in vector form Error(ββ)= yy XXββ TT yy XXββ X is N*p matrix, y is N-vector of training classes Want to minimize Error(ββ) Take derivative with respect to (w.r.t.) ββ XX TT yy XXββ = 0 (zero derivative) XX TT yy = XX TT XXββ Solution (pseudo-inverse): ββ = XX TT XX 1 XX TT yy 11
12 Linear classifier results We have two classification regions red/blue Straight line separates between them Very stable; terrible bias
13 Discussion Linear model very stable & biased Nearest neighbors less stable & less biased Need moderate value of K Could interpret nearest neighbors as better Suffers from curse of dimensionality Scarcity of neighbors in high dimensions 13
14 Decision Theory [Hastie et al., Section 2.4] Keywords: decision theory
15 Decision theory More general problem minimize loss function Square error loss function NN We ve seen Error(ββ)= nn=1 yy nn xx TT nn ββ 2 Emphasizes/accentuates larger errors Bayesian case (known statistics) conditional expectation YY =E[Y X] Absolute error NN Our example: Error(ββ)= nn=1 yy nn xx TT nn ββ Smaller errors receive more weight Leads to median estimator; typically viewed as more robust Bayesian binary classification selects most probable class 15
16 Example Consider Y={2,1,1.5,2,17} Let s show for Error ββ = YY 2 2 that average ββ = 4.7 minimizes error (details in supplements) Will show for Error ββ = YY 1 that median ββ = 2 minimizes error (details in supplements) 16
17 Clustering (Related to Project 1) [Hastie et al., Section 14.3] Keywords: clustering, unsupervised learning
18 Goals Want to group data into clusters that seem related Central notion degree of similarity between different clusters Typical algorithmic approach is iterative; move points between clusters, recalculate cluster centers 18
19 K means algorithm Initialize K cluster centers Select K points among training data Iterate until convergence: Associate each training datum with nearest cluster center Recompute cluster centers as average of training data in cluster Sensitive to initialization (can get stuck in local optimum) Other clustering algos use model for cluster 19
20 More about K means Map datum x n to cluster C(n)=k to representation level r k, k=k(n)=c(n) Squared error between x n and r k dd xx nn, rr kk) = PP pp=1 (xx nnnn rrrrrr) 2 = xx nn rr kk 2 Want rr kk = min xx R PP {nn:cc nn =kk} xx nn rrrr 2 Select cluster center, rr kk = Summed square errors for mapping C Error CC = NN 2 nn=1 xx nn rr kk(nn) 1 {nn:cc nn =kk} {nn:cc nn =kk} xx nn sum over N data = KK kk=1 {nn:cc nn =kk} xx nn rr 2 kk sum over K clusters 20
21 Linear Regression [Hastie et al., Chapter 3] Keywords: linear models, regression
22 Recall linear model pp Linear model for data YY = ββ 0 + jj=1 XX jj ββ jj Linear model is tractable, often works well, and many systems approximately linear Favorable feedback loop between math/algorithms for linear models and hardware w/ linear properties 22
23 Meaning of X? pp YY = ββ 0 + jj=1 XX jj ββ jj Variables X j could be: 1. Actual inputs (example: data about patients) 2. Function of inputs (example: log, square root) 3. Polynomials of inputs Taylor series approximation 4. Codes of classes (example: red=1, green=2, blue=3) 5. Interactions between variables (example: X 3 =X 1 X 2 ) pp Function ff(xx) = ββ 0 + jj=1 underlying factors XX jj ββ jj could be nonlinear in 23
24 Classical solution Least squares method [Gauss, 1795] NN Minimize sum of square errors Error(ββ)= nn=1 yy nn xx TT nn ββ 2 Closed form solution: ββ = XX TT XX 1 XX TT yy Pseudo-inverse XX + = XX TT XX 1 XX TT Geometric interpretation Interpret inputs as matrix, X={1, X 1, X 2,, X p } Consider vector space spanned by columns of X Project Y to space, forming YY = XX + YY Residual Y =Y- YY orthogonal to space 24
25 Real world example Consider financial prediction Y n = future % change in price of stock n {1,,N} X n1 = earnings / price High is good, want high earnings/profits X n2 = book / price Book value = value of company on paper; high is good X n3 = % price change last 6 months Price momentum: stocks that went up keep going up (on average) Can have (many) more variables X Goal: predict future returns (% price change) from data 25
26 Example continued How to predict future returns? pp Form linear model for returns YY nn = ββ 0 + jj=1 XX nnnn ββ jj Compute parameters β Can use least squares Apply parameters for linear model to predict (future) returns 26
27 Multivariate and Univariate Regression [Hastie et al., Chapter 3.3] Keywords: multivariate linear regression, orthogonalization, univariate linear regression
28 Start with univariate Univariate linear model, Y=Xβ+ε Keep it simple - no intercept Will show ββ = nn=1 NN NN nn=1 xx nn yy nn (xxxx) 2 (details in supplements) To do so, use vector notation: x = vector of x n y = vector of y n Inner product <x,y>= NN nn=1 xx nn yy nn ββ = <xx,yy> <xx,xx> Can compute residual (error), r=y-x ββ 28
29 From univariate to multivariate Suppose that columns x 1,, x p orthogonal Orthogonal means <x j,x k >=0 for j k Can show ββ ii = <xxxx,yy> <xxxx,xxxx> (for orthogonal case) Challenge: real data isn t orthogonal Solution: orthogonalize it (e.g. Gram-Schmidt procedure) But orthogonalization can be numerically sensitive 29
30 Subset Selection [Hastie et al., Chapter 3.4] Keywords: feature selection, subset selection
31 Some problems with least squares Least squares often outputs entirely nonzero vector Dense (non-sparse) output corresponds to complicated linear model More convenient/insightful for human operator to interpret simpler model (humans comfortable w/5-10 variables, not 500) Slight loss of accuracy with sparse output may be preferred Least squares doesn t use possible knowledge about structure of β Example structures: sparse β, discrete valued β, non-negative β, Least squares (multiplication by pseudo-inverse) ignores these 31
32 Example where least squares fails Consider N users each with phone Most users are inactive K<<N active users transmit {-1,+1} Communication channel often modeled as linear Y=HX+Noise; X comprised of {-1,0 (inactive),+1} Least squares multiplies Y by pseudo-inverse H + H + Y contains elements not in {-1,0,+1} But we want output to be in {-1,0,+1} 32
33 Subset selection Want sparse β select subset of columns of matrix X that approximate Y well Called subset selection Requires combinatorial complexity in general Various techniques provide faster runtime May approach best solution but not find it Will study fast algorithms for this; but they require elements of X to be independent and identically distributed (i.i.d.) 33
34 Another real world example Medical setting Columns of X are measurements relating to patients Examples: height, blood pressure, heart problems in family, Y is medical outcome we re trying to predict Example: blood pressure, sugar level, future treatment costs, Want to understand relations between conditions, possibly saving $ Lots of columns (dozens, hundreds, ) challenging for physician to comprehend linear model Subset selection model based on few terms simpler for human to work with 34
35 Shrinkage [Hastie et al., Chapter 3.4.3] Keywords: LASSO, ridge regression, shrinkage
36 Shrinkage and ridge regression Subset selection throws out some columns may have large errors Shrinkage methods modify β more subtly They shrink elements of β toward zero Ridge regression penalizes large β ββ rrrrrrrrrr = argmin ββ yy XXββ 2 + λλ ββ 2 Can be implemented with modified pseudo inverse, ββ rrrrrrrrrr =(X T X+λI) -1 X T y Identity matrix I new term is l 2 norm 36
37 LASSO Least absolute shrinkage and selection operator (LASSO) ββ llllllllll = argmin ββ yy XXββ 2 s.t. ββ 1 tt Can express w/lagrange multiplier λ: argmin ββ Small t drives many coefficients to zero yy XXββ 2 +λ ββ 1 Various fast algorithms implement LASSO Gradient projection for sparse reconstruction (GPSR) [Nowak et al.] Approximate message passing (AMP) [Donoho et al.] 37
38 Example: Estimating Sparse Inputs [Sarvotham, B, & Baraniuk, Sudocodes - Fast Measurement and Reconstruction of Sparse Signals, Int. Symp. Info. Theory, 2006] Keywords: sudocodes
39 How can we estimate sparse β? LASSO can drive many coeffs to zero; reasonable? Let s revisit linear problem, Y=Xβ, and assume: Sparse β (mostly zeros; will justify later in course) Real valued β No noise Linear algebra 101: X must contain at least as many observations as unknowns Incorrect when β contains structure (e.g., sparsity)! 39
40 Example Y X β ??????
41 Example What does zero measurement imply? Hint: sparse and real-valued β Y X β ??????
42 Example Graph reduction! Y X β ? 0 0???
43 Example What do matching measurements imply? Hint: non-zeros in x are real numbers Y X β ? 0 0???
44 Example What is the last entry of x? ?
45 And more Resembles solving Sudoku puzzles BUT Sparse β No noise Binary matrix X Can expand to noisy data, for example by evaluating almostmatching measurements Details in [Ma, B, & Needell, Two-Part Reconstruction with Noisy- Sudocodes, IEEE Trans. Signal Proc., 2014] Will study more powerful algorithms using sparse signal processing 45
46 Linear Methods for Classification [Hastie et al., Chapter 4] Keywords: decision boundaries, linear discriminant analysis, logistic regression
47 Classification problem Data X takes values in class G={1,,K} Want to predict G(X) Main idea: Create affine function f k (X)= β 0 + β T X, k {1,,K} Affine linear, need not go through origin Note that β depends on k Predict GG XX = argmax kk 1,,KK ff kk (XX) Decision boundary between classes f k (X)=f l (X) 47
48 Illustration [Hastie et al. Fig. 4.1] Data from K=3 classes illustrated by colored numbers Linear boundaries Computed by linear discriminant analysis (LDA) LDA details later Quadratic boundaries Add cross terms (X 1 X 2, (X 1 ) 2, (X 2 ) 2 ) to space Run LDA 48
49 Linear regression of indicator matrix Define K indicator variables 1, GG XX = kk YY kk XX =, kk {1,, KK} 0, eeeeeeee Form K variables into vector, Y=(Y 1,,Y K ) N training instances form N K indicator matrix Train linear regression on indicator matrix YY = XXXX + YY Pseudo-inverse XX + = XX TT XX 1 XX TT First column of X will contains ones (bias) 49
50 How is this useful? Classify: GG XX = argmax kk 1,,KK ff kk (XX) f k (X) uses linear regression Interpretation: f k (X) resembles Pr(G(X)=k X) Due to E[indicator]=Probability In general, as N increases will add more basis expansions (e.g., polynomial terms) and run regression on those 50
51 Limitation Classes can be masked by others Here we have K=3 classes; but only 2 recognized In general order-(k-1) polynomial corrects this 51
52 Linear Discriminant Analysis [Hastie et al., Section 4.3] Keywords: linear discriminant analysis, quadratic discriminant analysis
53 Bayesian approach to classification Consider f k (X) being posterior densities, f k (X)=f(X k) Class k has probability π k 0 (Σ k π k =1) Bayes theorem: Pr kk XX = = = KK kk=1 KK kk=1 ππ kk ff(xx kk) KK kk`=1 ππ kk` ff(xx kk`) ππ kk ffff(xx) ππ kk` ffff (XX) Pr(XX,kk) Pr(XX,kkk) Need statistical model for posterior densities 53
54 Example (Question 3, practice midterm 2016) Two classes with pdfs f blue =0.5NN(-1,1)+0.5 NN(2,1) and f red = NN(0,1); Pr(blue)=Pr(red)=0.5 Means of Gaussian components differ Variances of all Gaussian components are 1 For blue class, 0.5 = probabilities of Gaussian components Let s derive Bayesian classification approach Will work out; detailed solution in supplement 54
55 What posterior model to use? IF we know true f k (X)=f(X k), we re done BIG IF Possible models: Gaussian densities (linear discriminant analysis) Gaussian mixtures nonlinear decision boundaries Nonparametric densities Naïve Bayes (separable pdf) 55
56 Multivariate Gaussian posterior Choose multivariate Gaussian density ff kk XX = 1 Cluster mean μμ kk (2ππ) pp Σ kk exp 1 2 XX μμ kk TT Σ kk 1 XX μμ kk Covariance matrix Σ kk Entry ij captures covariance between elements i & j Assume same covariance matrix all classes, ΣΣ kk = ΣΣ Will revisit assumption soon 56
57 Interpretation Each class has Gaussian centered around µ k Elliptical level set (contour) for Gaussian component Structure of ellipse depends on Σ 57
58 Linear discriminant analysis (LDA) Can show log Pr(GG=kk XX) Pr(GG=ll XX) = log ππ kk ππ ll 1 2 μμ kk + μμ ll TT Σ 1 μμ kk μμ ll + xxxxσ 1 μμ kk μμ ll Linear decision boundary Classification: Linear discriminant function δδ kk XX = log ππ kk + xx TT Σ 1 μμ kk 1 2 μμ kk TT Σ 1 μμ kk Choose k that maximizes δδ kk XX In practice, will learn Σ, µ k, π k Closely related to linear regression approach 58
59 Quadratic discriminant analysis (QDA) What if covariance matrices ΣΣ kk are different? Each class has different elliptical set Decision boundaries become quadratic QDA requires more parameters than LDA needs somewhat larger N to work well In general, LDA and QDA have good performance Data seldom support complicated decision rules LDA and QDA yield robust/stable boundaries 59
60 Logistic Regression [Hastie et al., Section 4.4] Keywords: logistic regression
61 Motivation Want to model posteriors linearly log Pr(GG=ll XX) Pr(GG=KK XX) = ββ lll +ββ ll TT XX, class l=0,1,, K-1 Can show: Pr G = l X = Pr G = K X = TT exp(ββ lll +ββ ll XX ) KK 1 exp(ββ mm0 +ββ TT mm XX ) 1+ mm=1 1 KK 1 exp(ββ mm0 +ββ TT mm XX ) 1+ mm=1 61
62 What s the hard part? Hard part is learning the parameters Book describes some details Left for possible final project 62
63 Basis Expansions [Hastie et al., Section ] Keywords: basis expansion, regularization, splines, Taylor approximation
64 Why linear models? So far have discussed linear models quite a bit Linear first order Taylor approximation Hard to fit more complex functions when N small or P large But real-world functions/data aren t linear 64
65 Augmented models Begin with vector of inputs (X 1,, X P ) Augment/replace inputs with transformations of X Transformation h m (X): R P R Linear basis expansion ff XX = MM mm=1 ββ mm h mm (XX) Will fit parameters β m 65
66 Example basis expansions 1. h m (X)=X m original linear model 2. h m (X)=(X j ) 2 or X j X k second (or higher) order Taylor Beware growth in number of terms (need to regularize) 3. Piecewise polynomials/splines Control complexity using variable selection or regularization 66
67 Why piecewise polynomials? Global fit fluctuates wildly Piecewise fits for local structure Enforce continuity Constraints reduce degrees of freedom 67
68 Common approach: Cubic splines Fits well within region Fluctuates even more elsewhere Also have multi dimensional splines, wavelets 68
69 Real world application Recall financial example X contains financial data Prices, firm profits, sales, Linear model for future prices may be restrictive Augment X with basis functions possibly better fit 69
70 Kernel Methods [Hastie et al., Section ] Keywords: kernel method
71 Recall nearest neighbors ff XX = Average(YYYY XXXX NNNN(XX)) N K (X) set of K nearest neighbors Yellow rectangle: height represents weights (1/K) Points leave/enter N K (X) NN estimator is bumpy 71
72 Key idea Maybe give different weights to XX ii NNNN(XX)? Assign greater confidence (weight) to nearer points Taper away for distant points ff XX 0 = nn=1 NN KK λλ XX 0,XX nn YY nn NN KK λλ (XX 0,XX nn ) nn=1 KK λλ XX 0, XX = DD XX XX 0 λλ Width parameter λλ Can choose automatically 72
73 Nearest neighbors (NN) for regression In the past we discussed nearest neighbors (NN) for classification using majority/plurality votes Can also be used for regression Perform locally-linear fit 73
74 Support Vector Machines [Hastie et al., Sections ] Keywords: separating hyperplane, support vector machines
75 From linear to nonlinear classifiers Main idea: find hyperplane that best-separates two classes of data Can seldom separate all the data Allow erroneous slack variables Select points locally (w/kernel?) Nonlinear mapping to higher dimension 75
76 Simple case How do we compute perfectly separating hyperplane? Hyperplane = {X: f(x)=x T β+β 0 =0}, β 0 R, β R P Classifier G(X)=sign{X T β+β 0 } Perfect classification Y n f(x n )>0, n Best hyperplane maximizes margin M max MMsubject to YY nn XX TT nn ββ + ββ 0 MM, nn ββ,ββ 0, ββ =1 Rewrite: min ββ subject to YY nn XX TT nn ββ + ββ 0 1, nn ββ,ββ 0 76
77 Slack variables Perfectly separating hyperplane may not exist Require YY nn XX TT nn ββ + ββ 0 MM 1 ξξ nn, nn Slack variables {ξξ nn } should be 0, have small l 1 norm Details in book 77
78 SVM + non-linear mappings Perfectly separating hyperplane may exist in highdimensional space Map R PP to high-dimensional space Kernels, splines, polynomials Find hyperplane w/best margin in transformed space 78
Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Support Vector Machines CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 A Linearly Separable Problem Consider the binary classification
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationIntroduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones
Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive
More informationRecap from previous lecture
Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationCMSC858P Supervised Learning Methods
CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors
More informationStatistical Methods for Data Mining
Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationDan Roth 461C, 3401 Walnut
CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn
More informationSupport Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationStatistical Methods for SVM
Statistical Methods for SVM Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot,
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationMax Margin-Classifier
Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationContents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)
Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic
More informationChap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University
Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics
More informationLinear Methods for Prediction
Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More informationLecture 4 Discriminant Analysis, k-nearest Neighbors
Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationSupport Vector Machines
Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot, we get creative in two
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationLeast Squares Regression
E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute
More informationLinear Models for Regression. Sargur Srihari
Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood
More information10-701/ Machine Learning - Midterm Exam, Fall 2010
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative
More informationLecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher
Lecture 3 STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher Previous lectures What is machine learning? Objectives of machine learning Supervised and
More informationKernel Methods and Support Vector Machines
Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete
More informationECE 421 Introduction to Signal Processing
ECE 421 Introduction to Signal Processing Dror Baron Assistant Professor Dept. of Electrical and Computer Engr. North Carolina State University, NC, USA Denoising and Project 4 Where does denoising appear?
More informationMidterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.
CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic
More informationMath for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han
Math for Machine Learning Open Doors to Data Science and Artificial Intelligence Richard Han Copyright 05 Richard Han All rights reserved. CONTENTS PREFACE... - INTRODUCTION... LINEAR REGRESSION... 4 LINEAR
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationLecture 3. Linear Regression II Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationLinear Models for Regression
Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationCS 195-5: Machine Learning Problem Set 1
CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of
More informationRadial Basis Function (RBF) Networks
CSE 5526: Introduction to Neural Networks Radial Basis Function (RBF) Networks 1 Function approximation We have been using MLPs as pattern classifiers But in general, they are function approximators Depending
More informationNon-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines
Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall
More information10-725/36-725: Convex Optimization Prerequisite Topics
10-725/36-725: Convex Optimization Prerequisite Topics February 3, 2015 This is meant to be a brief, informal refresher of some topics that will form building blocks in this course. The content of the
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More information11 More Regression; Newton s Method; ROC Curves
More Regression; Newton s Method; ROC Curves 59 11 More Regression; Newton s Method; ROC Curves LEAST-SQUARES POLYNOMIAL REGRESSION Replace each X i with feature vector e.g. (X i ) = [X 2 i1 X i1 X i2
More informationPart I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes
Week 10 Based in part on slides from textbook, slides of Susan Holmes Part I Linear regression & December 5, 2012 1 / 1 2 / 1 We ve talked mostly about classification, where the outcome categorical. If
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining Linear Classifiers: predictions Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due Friday of next
More informationKernel Methods. Machine Learning A W VO
Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance
More informationLecture 10: Logistic Regression
BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics Lecture 10: Logistic Regression Jie Wang Department of Computational Medicine & Bioinformatics University of Michigan 1 Outline An
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationSTATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010
STATS306B Discriminant Analysis Jonathan Taylor Department of Statistics Stanford University June 3, 2010 Spring 2010 Classification Given K classes in R p, represented as densities f i (x), 1 i K classify
More informationMachine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015
Machine Learning Regression basics Linear regression, non-linear features (polynomial, RBFs, piece-wise), regularization, cross validation, Ridge/Lasso, kernel trick Marc Toussaint University of Stuttgart
More informationCOMS 4771 Regression. Nakul Verma
COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the
More informationRegression, Ridge Regression, Lasso
Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationIntroduction to Machine Learning
Outline Introduction to Machine Learning Bayesian Classification Varun Chandola March 8, 017 1. {circular,large,light,smooth,thick}, malignant. {circular,large,light,irregular,thick}, malignant 3. {oval,large,dark,smooth,thin},
More informationLINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning
LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary
More informationCS 6375 Machine Learning
CS 6375 Machine Learning Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Course Info. Instructor: Nicholas Ruozzi Office: ECSS 3.409 Office hours: Tues.
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationSVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning
SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:
More informationMachine Learning 2017
Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationFinal Exam, Machine Learning, Spring 2009
Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationNearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2
Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2016 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More informationL5 Support Vector Classification
L5 Support Vector Classification Support Vector Machine Problem definition Geometrical picture Optimization problem Optimization Problem Hard margin Convexity Dual problem Soft margin problem Alexander
More information18.9 SUPPORT VECTOR MACHINES
744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More informationLECTURE NOTE #3 PROF. ALAN YUILLE
LECTURE NOTE #3 PROF. ALAN YUILLE 1. Three Topics (1) Precision and Recall Curves. Receiver Operating Characteristic Curves (ROC). What to do if we do not fix the loss function? (2) The Curse of Dimensionality.
More informationKernel methods, kernel SVM and ridge regression
Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;
More informationLinear Regression. Aarti Singh. Machine Learning / Sept 27, 2010
Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X
More informationLinear and Logistic Regression. Dr. Xiaowei Huang
Linear and Logistic Regression Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Two Classical Machine Learning Algorithms Decision tree learning K-nearest neighbor Model Evaluation Metrics
More informationJeff Howbert Introduction to Machine Learning Winter
Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters
More informationLinear Regression. CSL603 - Fall 2017 Narayanan C Krishnan
Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization
More informationStatistical Learning with the Lasso, spring The Lasso
Statistical Learning with the Lasso, spring 2017 1 Yeast: understanding basic life functions p=11,904 gene values n number of experiments ~ 10 Blomberg et al. 2003, 2010 The Lasso fmri brain scans function
More informationShort Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning
Short Course Robust Optimization and 3. Optimization in Supervised EECS and IEOR Departments UC Berkeley Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012 Outline Overview of Supervised models and variants
More informationLinear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan
Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis
More informationCh 4. Linear Models for Classification
Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write
More informationCSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression
CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html
More informationPattern Recognition 2018 Support Vector Machines
Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 48 Support Vector Machines Ad Feelders ( Universiteit Utrecht
More informationSupervised Learning Coursework
Supervised Learning Coursework John Shawe-Taylor Tom Diethe Dorota Glowacka November 30, 2009; submission date: noon December 18, 2009 Abstract Using a series of synthetic examples, in this exercise session
More informationLinear Discrimination Functions
Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach
More informationSupport Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM
1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University
More informationClassification. Chapter Introduction. 6.2 The Bayes classifier
Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode
More information