BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1 Shaobo Li University of Cincinnati 1 Partially based on Hastie, et al. (2009) ESL, and James, et al. (2013) ISLR Data Mining I Lecture 6. Other Algorithms 1 / 19

Overview Supervised Learning Naive Bayes Classier Support Vector Machine Neural Networks Unsupervised Learning Principle Component Analysis Data Mining I Lecture 6. Other Algorithms 2 / 19

Naive Bayes Classier Supervised learning, classication problem Simple yet powerful classier Popular in Text mining Document classication Spam ltering Face recognition Sentiment analysis Data Mining I Lecture 6. Other Algorithms 3 / 19

Bayes Theorem Posterior probability P(Y = k x) = P(x Y = k) P(Y = k), P(x) - P(Y = k) is prior: proportion of observation in class k - P(x Y = k) is likelihood of features in class k Denominator does not depend on Y P(Y = k x) P(x Y = k)p(y = k) Data Mining I Lecture 6. Other Algorithms 4 / 19

Naive Bayes Bayes classier based on Bayes theorem C B (x) = arg maxp(y = k x) k = arg maxp(x Y = k)p(y = k) k "Naive assumption": all features (X 's) are independent, that is p P(x Y = k) = P(x j Y = k) By taking logarithm (monotone transformation), p C B (x) = arg max log P(x k j Y = k) + log P(Y = k) j=1 NB works well even the independence assumption is violated in most cases. j=1 Data Mining I Lecture 6. Other Algorithms 5 / 19

Gaussian Naive Bayes When the feature X 's are continuous Given class k, for each feature we can estimate µ k and σ 2 k. Then, 1 P(x Y = k) = exp ( (x ˆµ ) k) 2 2πˆσ 2ˆσ k 2 k 2 Under equal variance assumption, it is equivalent to LDA. Data Mining I Lecture 6. Other Algorithms 6 / 19

Multinomial Naive Bayes Typically for text classication Suppose p is number of features, e.g. size of vocabulary, x = (x 1,..., x p ) are the counts of each feature. Let θ kj be the proportion of feature j in class k, then we have P(x Y = k) = ( p j=1 x j)! p j=1 x j! }{{} does not depend on Y Therefore, by taking logarithm, the multinomial Bayes classier is p C B (x) = arg max x k j log θ kj + log P(Y = k) This is a linear classier. j=1 Data Mining I Lecture 6. Other Algorithms 7 / 19 p j=1 θ x j kj

Support Vector Machine Raised in computer science community Very popular in practice for classication Goal: nd a hyperplane that separates two classes How: Support vector machine Data Mining I Lecture 6. Other Algorithms 8 / 19

Hyperplane Flat ane subspace with dimension p 1 - p = 2: X 1, X 2, the hyperplane is a line - p = 3: X 1, X 2, X 3, the hyperplane is a plane The hyperplane can be expressed as f (x) = β 0 + β 1 x 1 +... + β p x p = β 0 + β T x = 0 Usually, we require β = 1 as constraint, so that the distance of any point x to the hyperplane is f (x ). f (x ) > 0: x on one side, f (x ) < 0: on the other side Data Mining I Lecture 6. Other Algorithms 9 / 19

An Illustration Data Mining I Lecture 6. Other Algorithms 10 / 19

Maximal Margin Classier Code response variable Y with {-1, 1} Maximize the margin What if the data is non-separable? Data Mining I Lecture 6. Other Algorithms 11 / 19

Support Vector Classier The perfect separating hyperplane does not exist. We allow a few points to be misclassied. Dene slack variable ɛ 1,..., ɛ n, the optimization problem is maximize β 0,β,ɛ M subject to β = 1, y i (β 0 + β T x i ) M(1 ɛ i ), n ɛ i 0, ɛ i C, i=1 where C is nonnegative tuning parameter. Classify x on the sign of f (x ). Data Mining I Lecture 6. Other Algorithms 12 / 19

Role of C and Support Vectors C controls the width of the margin More support vectors lie in wider margin, low variance high bias. Data Mining I Lecture 6. Other Algorithms 13 / 19

Feature Expansion and Kernels Nonlinear support vector classier, e.g., polynomial space - Enlarge the feature space by transformation - e.g., quadratic feature space (X 1, X 2, X 2 1, X 2 2, X 1X 2 ) A more elegant mapping is through Kernel - Kernel is a function to quantify similarity of two observations - Linear kernel: K(x i, x i ) = x i, x i, inner product - Polynomial kernel: K(x i, x i ) = (1 + x i, x i ) d - Radial kernel: K(x i, x i ) = exp( γ x i x i 2 ) The linear support vector classier can be represented as f (x) = β 0 + n α i x, x i i=1 Good thing is, ˆα i is nonzero only for the support vectors. Data Mining I Lecture 6. Other Algorithms 14 / 19

Support Vector Machine More general, with kernel, the support vector machine is f (x) = β 0 + i S α i K(x, x i ), where S is the set of indices of support vectors. Data Mining I Lecture 6. Other Algorithms 15 / 19

Neural Networks Transformation of input data Hidden layer (neurons) Black-box, hard to interpret One hidden layer "vanilla" neural networks Input data x = x 1,..., x p Transformation: Z m = σ(α 0m + α T m x), where σ(v) = 1/(1 + e v ), the sigmoid function is usually chosen Output: f k (x) = g k (T ), where T k = β 0k + β T k z Regression: g k (T ) = T k Multiclass classication: g k (T ) = e T k / K l et l Data Mining I Lecture 6. Other Algorithms 16 / 19

Illustraion Data Mining I Lecture 6. Other Algorithms 17 / 19

Principle Component Analysis Unsupervised learning PCA produces low dimensional representation (approximation) of data - Normalized linear combination - Maximal variance - Uncorrelated (orthogonal) Can also be used for data visualization Pre-processing data, dimensional reduction An illustration Data Mining I Lecture 6. Other Algorithms 18 / 19

Computation The rst principle component of the sample can be expressed as z i 1 = φ 11 x i 1 + φ 21 x i 2 +... + φ p1 x ip, where the vector (φ 11,..., φ p1 ) is called loadings, and we need constraint p j=1 φ2 j 1 = 1. We want to maximize the variance of the newly created variable 1 maximize φ 11,...,φ p1 n n p φ j 1 x ij i=1 j=1 We use singular value decomposition to solve for φ. X = UDV T Columns of UD are called principle components of X. 2 Data Mining I Lecture 6. Other Algorithms 19 / 19