BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

Similar documents
Support Vector Machines

Introduction to SVM and RVM

Statistical Methods for SVM

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Support Vector Machines

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Gaussian discriminant analysis Naive Bayes

Naïve Bayes classification

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

Statistical Methods for Data Mining

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Introduction to Machine Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Pattern Recognition 2018 Support Vector Machines

Multivariate statistical methods and data mining in particle physics

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Machine Learning (BSMC-GA 4439) Wenke Liu

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines for Classification: A Statistical Portrait

Notes on Discriminant Functions and Optimal Classification

Machine Learning Linear Classification. Prof. Matteo Matteucci

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Introduction to Logistic Regression and Support Vector Machine

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

CMU-Q Lecture 24:

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

Statistical Data Mining and Machine Learning Hilary Term 2016

Generative Clustering, Topic Modeling, & Bayesian Inference

Support Vector Machines

Support Vector Machine

Linear Regression and Discrimination

STATISTICAL LEARNING SYSTEMS

Introduction to Machine Learning

Kernel methods, kernel SVM and ridge regression

Week 5: Logistic Regression & Neural Networks

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

CSC321 Lecture 18: Learning Probabilistic Models

Linear & nonlinear classifiers

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Statistical Methods for NLP

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Computer Vision Group Prof. Daniel Cremers. 3. Regression

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Classification 2: Linear discriminant analysis (continued); logistic regression

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Lecture 2: Logistic Regression and Neural Networks

Gaussian Mixture Models

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Lecture 4 Discriminant Analysis, k-nearest Neighbors

COS513 LECTURE 8 STATISTICAL CONCEPTS

Introduction to Probabilistic Machine Learning

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Statistical Machine Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

Kernel Logistic Regression and the Import Vector Machine

Advanced statistical methods for data analysis Lecture 2

Neural Networks Lecture 4: Radial Bases Function Networks

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

CS534 Machine Learning - Spring Final Exam

Kernel Methods. Machine Learning A W VO

Recap from previous lecture

An Introduction to Statistical and Probabilistic Linear Models

MLPR: Logistic Regression and Neural Networks

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Lecture 6: Methods for high-dimensional problems

Support Vector Machine (continued)

Linear Regression and Its Applications

Pattern Recognition and Machine Learning

The Naïve Bayes Classifier. Machine Learning Fall 2017

Generative Learning algorithms

Lecture 10: A brief introduction to Support Vector Machine

Density Estimation. Seungjin Choi

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Statistical Machine Learning from Data

Machine learning for pervasive systems Classification in high-dimensional spaces

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

5. Discriminant analysis

Transcription:

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1 Shaobo Li University of Cincinnati 1 Partially based on Hastie, et al. (2009) ESL, and James, et al. (2013) ISLR Data Mining I Lecture 6. Other Algorithms 1 / 19

Overview Supervised Learning Naive Bayes Classier Support Vector Machine Neural Networks Unsupervised Learning Principle Component Analysis Data Mining I Lecture 6. Other Algorithms 2 / 19

Naive Bayes Classier Supervised learning, classication problem Simple yet powerful classier Popular in Text mining Document classication Spam ltering Face recognition Sentiment analysis Data Mining I Lecture 6. Other Algorithms 3 / 19

Bayes Theorem Posterior probability P(Y = k x) = P(x Y = k) P(Y = k), P(x) - P(Y = k) is prior: proportion of observation in class k - P(x Y = k) is likelihood of features in class k Denominator does not depend on Y P(Y = k x) P(x Y = k)p(y = k) Data Mining I Lecture 6. Other Algorithms 4 / 19

Naive Bayes Bayes classier based on Bayes theorem C B (x) = arg maxp(y = k x) k = arg maxp(x Y = k)p(y = k) k "Naive assumption": all features (X 's) are independent, that is p P(x Y = k) = P(x j Y = k) By taking logarithm (monotone transformation), p C B (x) = arg max log P(x k j Y = k) + log P(Y = k) j=1 NB works well even the independence assumption is violated in most cases. j=1 Data Mining I Lecture 6. Other Algorithms 5 / 19

Gaussian Naive Bayes When the feature X 's are continuous Given class k, for each feature we can estimate µ k and σ 2 k. Then, 1 P(x Y = k) = exp ( (x ˆµ ) k) 2 2πˆσ 2ˆσ k 2 k 2 Under equal variance assumption, it is equivalent to LDA. Data Mining I Lecture 6. Other Algorithms 6 / 19

Multinomial Naive Bayes Typically for text classication Suppose p is number of features, e.g. size of vocabulary, x = (x 1,..., x p ) are the counts of each feature. Let θ kj be the proportion of feature j in class k, then we have P(x Y = k) = ( p j=1 x j)! p j=1 x j! }{{} does not depend on Y Therefore, by taking logarithm, the multinomial Bayes classier is p C B (x) = arg max x k j log θ kj + log P(Y = k) This is a linear classier. j=1 Data Mining I Lecture 6. Other Algorithms 7 / 19 p j=1 θ x j kj

Support Vector Machine Raised in computer science community Very popular in practice for classication Goal: nd a hyperplane that separates two classes How: Support vector machine Data Mining I Lecture 6. Other Algorithms 8 / 19

Hyperplane Flat ane subspace with dimension p 1 - p = 2: X 1, X 2, the hyperplane is a line - p = 3: X 1, X 2, X 3, the hyperplane is a plane The hyperplane can be expressed as f (x) = β 0 + β 1 x 1 +... + β p x p = β 0 + β T x = 0 Usually, we require β = 1 as constraint, so that the distance of any point x to the hyperplane is f (x ). f (x ) > 0: x on one side, f (x ) < 0: on the other side Data Mining I Lecture 6. Other Algorithms 9 / 19

An Illustration Data Mining I Lecture 6. Other Algorithms 10 / 19

Maximal Margin Classier Code response variable Y with {-1, 1} Maximize the margin What if the data is non-separable? Data Mining I Lecture 6. Other Algorithms 11 / 19

Support Vector Classier The perfect separating hyperplane does not exist. We allow a few points to be misclassied. Dene slack variable ɛ 1,..., ɛ n, the optimization problem is maximize β 0,β,ɛ M subject to β = 1, y i (β 0 + β T x i ) M(1 ɛ i ), n ɛ i 0, ɛ i C, i=1 where C is nonnegative tuning parameter. Classify x on the sign of f (x ). Data Mining I Lecture 6. Other Algorithms 12 / 19

Role of C and Support Vectors C controls the width of the margin More support vectors lie in wider margin, low variance high bias. Data Mining I Lecture 6. Other Algorithms 13 / 19

Feature Expansion and Kernels Nonlinear support vector classier, e.g., polynomial space - Enlarge the feature space by transformation - e.g., quadratic feature space (X 1, X 2, X 2 1, X 2 2, X 1X 2 ) A more elegant mapping is through Kernel - Kernel is a function to quantify similarity of two observations - Linear kernel: K(x i, x i ) = x i, x i, inner product - Polynomial kernel: K(x i, x i ) = (1 + x i, x i ) d - Radial kernel: K(x i, x i ) = exp( γ x i x i 2 ) The linear support vector classier can be represented as f (x) = β 0 + n α i x, x i i=1 Good thing is, ˆα i is nonzero only for the support vectors. Data Mining I Lecture 6. Other Algorithms 14 / 19

Support Vector Machine More general, with kernel, the support vector machine is f (x) = β 0 + i S α i K(x, x i ), where S is the set of indices of support vectors. Data Mining I Lecture 6. Other Algorithms 15 / 19

Neural Networks Transformation of input data Hidden layer (neurons) Black-box, hard to interpret One hidden layer "vanilla" neural networks Input data x = x 1,..., x p Transformation: Z m = σ(α 0m + α T m x), where σ(v) = 1/(1 + e v ), the sigmoid function is usually chosen Output: f k (x) = g k (T ), where T k = β 0k + β T k z Regression: g k (T ) = T k Multiclass classication: g k (T ) = e T k / K l et l Data Mining I Lecture 6. Other Algorithms 16 / 19

Illustraion Data Mining I Lecture 6. Other Algorithms 17 / 19

Principle Component Analysis Unsupervised learning PCA produces low dimensional representation (approximation) of data - Normalized linear combination - Maximal variance - Uncorrelated (orthogonal) Can also be used for data visualization Pre-processing data, dimensional reduction An illustration Data Mining I Lecture 6. Other Algorithms 18 / 19

Computation The rst principle component of the sample can be expressed as z i 1 = φ 11 x i 1 + φ 21 x i 2 +... + φ p1 x ip, where the vector (φ 11,..., φ p1 ) is called loadings, and we need constraint p j=1 φ2 j 1 = 1. We want to maximize the variance of the newly created variable 1 maximize φ 11,...,φ p1 n n p φ j 1 x ij i=1 j=1 We use singular value decomposition to solve for φ. X = UDV T Columns of UD are called principle components of X. 2 Data Mining I Lecture 6. Other Algorithms 19 / 19