Linear Methods in Data Mining

Similar documents
Let A an n n real nonsymmetric matrix. The eigenvalue problem: λ 1 = 1 with eigenvector u 1 = ( ) λ 2 = 2 with eigenvector u 2 = ( 1

Linear Algebra Methods for Data Mining

EIGENVALE PROBLEMS AND THE SVD. [5.1 TO 5.3 & 7.4]

Linear Algebra Methods for Data Mining

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Manning & Schuetze, FSNLP (c) 1999,2000

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

be a Householder matrix. Then prove the followings H = I 2 uut Hu = (I 2 uu u T u )u = u 2 uut u

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Numerical Methods I Singular Value Decomposition

Linear Algebra Review. Vectors

Manning & Schuetze, FSNLP, (c)

Applied Mathematics 205. Unit II: Numerical Linear Algebra. Lecturer: Dr. David Knezevic

Singular Value Decomposition

The Singular Value Decomposition

Probabilistic Latent Semantic Analysis

1 Singular Value Decomposition and Principal Component

Algebra C Numerical Linear Algebra Sample Exam Problems

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Linear Algebra, part 3 QR and SVD

Linear Algebra (Review) Volker Tresp 2018

Linear Algebra in Actuarial Science: Slides to the lecture

The Singular Value Decomposition

Main matrix factorizations

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

IV. Matrix Approximation using Least-Squares

I. Multiple Choice Questions (Answer any eight)

Singular Value Decomposition

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Problems. Looks for literal term matches. Problems:

Singular Value Decompsition

Singular Value Decomposition and Principal Component Analysis (PCA) I

Linear Subspace Models

A few applications of the SVD

CS 143 Linear Algebra Review

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

AM 205: lecture 8. Last time: Cholesky factorization, QR factorization Today: how to compute the QR factorization, the Singular Value Decomposition

UNIT 6: The singular value decomposition.

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Principal components analysis COMS 4771

Latent Semantic Analysis. Hongning Wang

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Linear Algebra, part 3. Going back to least squares. Mathematical Models, Analysis and Simulation = 0. a T 1 e. a T n e. Anna-Karin Tornberg

Matrices, Vector Spaces, and Information Retrieval

14 Singular Value Decomposition

Lecture notes: Applied linear algebra Part 1. Version 2

Review problems for MA 54, Fall 2004.

MATH 350: Introduction to Computational Mathematics

EECS 275 Matrix Computation

Math 102, Winter Final Exam Review. Chapter 1. Matrices and Gaussian Elimination

STA141C: Big Data & High Performance Statistical Computing

Homework 1. Yuan Yao. September 18, 2011

Notes on Eigenvalues, Singular Values and QR

EE731 Lecture Notes: Matrix Computations for Signal Processing

7 Principal Component Analysis

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

18.06 Problem Set 8 - Solutions Due Wednesday, 14 November 2007 at 4 pm in

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012

Notes on Latent Semantic Analysis

Applied Linear Algebra in Geoscience Using MATLAB

Tutorial on Principal Component Analysis

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Singular Value Decomposition

Problem set 5: SVD, Orthogonal projections, etc.

Mathematical foundations - linear algebra

Linear Algebra (Review) Volker Tresp 2017

Parallel Singular Value Decomposition. Jiaxing Tan

Problem # Max points possible Actual score Total 120

STA141C: Big Data & High Performance Statistical Computing

Introduction to Numerical Linear Algebra II

Principal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014

The SVD-Fundamental Theorem of Linear Algebra

Singular value decomposition

Linear Least Squares. Using SVD Decomposition.

Cheat Sheet for MATH461

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

EE731 Lecture Notes: Matrix Computations for Signal Processing

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Math 407: Linear Optimization

15 Singular Value Decomposition

Singular Value Decomposition

linearly indepedent eigenvectors as the multiplicity of the root, but in general there may be no more than one. For further discussion, assume matrice

1 9/5 Matrices, vectors, and their applications

forms Christopher Engström November 14, 2014 MAA704: Matrix factorization and canonical forms Matrix properties Matrix factorization Canonical forms

Lecture 5 Singular value decomposition

Lecture 1: Review of linear algebra

Announce Statistical Motivation Properties Spectral Theorem Other ODE Theory Spectral Embedding. Eigenproblems I

Linear Algebra Methods for Data Mining

Lecture 2: Linear Algebra Review

Throughout these notes we assume V, W are finite dimensional inner product spaces over C.

MATH36001 Generalized Inverses and the SVD 2015

Numerical Methods for Solving Large Scale Eigenvalue Problems

Properties of Matrices and Operations on Matrices

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

Introduction to Machine Learning

The Singular Value Decomposition and Least Squares Problems

1. The Polar Decomposition

Transcription:

Why Methods? linear methods are well understood, simple and elegant; algorithms based on linear methods are widespread: data mining, computer vision, graphics, pattern recognition; excellent general software available for experimental work.

Software available the JAMA java package available free on Internet; MATLAB, excellent, but expensive; SCILAB (free) OCTAVE (free)

Least Squares for Main Problem: Given observations of p variables in n experiments a 1,..., a p in R n and the vector of the outcomes of the experiments b R n determine b 0, b 1,..., b p to express the outcome of the experiment b as b = α 0 + p a i α i. i=1

Data Sets as Tables Series of n experiments measuring p variables and an outcome: a 1 a p b a 11 a 1p b 1... a n1 a np b n

Least Squares for In Matrix Form: Determine α 0, α 1,..., α p such that α 0 α 1 (1 a 1 a p ). = b. α p This system consists of n equations and p + 1 unknowns α 0,..., α p and is overdetermined.

Example Data on relationship between weight (in lbs) and blood triglycerides: weight tgl 151 120 163 144 180 142 196 167 205 180 219 190 240 197

Example cont d 210 200 190 Triglycerides vs. Weight TGL 180 170 160 150 140 130 120 150 160 170 180 190 200 210 220 230 240 weight

Quest Find α 0 and α 1 such that tgl = α 0 + α 1 weight. 2 unknowns and 7 equations! α 0 + 151α 1 = 120 α 0 + 163α 1 = 144 α 0 + 180α 1 = 142 α 0 + 196α 1 = 167 α 0 + 205α 1 = 180 α 0 + 219α 1 = 190 α 0 + 240α 1 = 197

In matrix form: 1 151 1 163 1 180 1 196 1 205 1 219 1 240 ( α0 α 1 120 144 ) 142 = 167 180 190 197 or Aα = b. Finding parameters allows building a model and making predictions for values of tgl based on weight.

The Least Square Method (LSM) LSM is used in data mining as a method of estimating the parameters of a model. Estimation process is known as regression. Several types of regression exist depending on the nature of the assummed model of dependency.

Making the Best of Overdetermined Systems If Ax = b has no solution, the next best thing is finding c R n such that for every x R n. Ac b 2 Ax b 2 Ax range(a) for any x R n. Find a u range(a) such that Au is as close to b as possible.

A R m n be a full-rank matrix (rank(a) = n) such that m > n. The symmetric square matrix A A R n n has the same rank n as the matrix A (A A)x = A b has a unique solution. A A is positive definite because x A Ax = (Ax) Ax = Ax 2 2 > 0 if x 0.

Theorem Let A R m n be a full-rank matrix such that m > n and let b R m. The unique solution of the system (A A)x = A b equals the projection of the vector b on the subspace range(a). (A A)x = A b is known as the system of normal equations of A and b.

Example in MATLAB C = A A x = C \ (A b) C = x = ( 7 1354 ) 1354 267772 ( ) 7.3626 0.8800 line: tgl = 0.8800 weight 7.3626

Danger: Choosing the Simplest Way! Condition number of A A is the square of condition number of A. Forming A A leads to loss of information.

The Thin QR Decompositions Let A R m n be a matrix such that m n and rank(a) = n (full-rank matrix). A can be factored as ( ) R A = Q, O m n,n where Q R m m and R R n n such that i ii Q is an orthonormal matrix, and R = (r ij ) is an upper triangular matrix.

LSQ Approximation and QR Decomposition ( Au b = Q R ( = Q R O m n,n O m n,n ) u b ) u QQ b (because Q is orthonormal and therefore QQ = I m ) (( ) ) R = Q u Q b. O m n,n

LSQ Approximation and QR Decomposition (cont d) Au b 2 ( R 2= O m n,n ) u Q b If we write Q = (L 1 L 2 ), where L 1 R m n and L 2 R m (m n), then ( ) ( ) Au b 2 R L 2 = u 1 b 2 O m n,n L 2 b 2 ( ) Ru L = 1 b 2 L 2 b 2 2 2 = Ru L 1b 2 2 + L 2b 2 2.

LSQ Approximation and QR Decomposition (cont d) The system Ru = L 1b can be solved and its solution minimizes Au b 2.

Example [Q, R] = qr(a, 0) 1 1 1 2 A = 1 3 1 4 1 5 1 6 20 18 and b = 25 28 27 30

0.4082 0.5976 0.4082 0.3586 Q = 0.4082 0.1195 0.4082 0.1195 and R = 0.4082 0.3586 0.4082 0.5976 x = R \ (Q b) cond(a) = 9.3594 x = ( 16.66672.2857 ) ( ) 2.4495 8.5732 0 4.1833

Sample Data Sets V 1 V j V p E 1 x 1 x 11 x 1j x 1p...... E i x i x i1 x ij x ip...... E n x n x n1 x nj x np

Sample Matrix x 1 X E =. R n p. x n Sample Mean: x = 1 n (x 1 + + x n ) = 1 n 1 X.

Centered Sample Matrix x 1 x ˆX =. x n x Covariance Matrix: cov(x ) = 1 n 1 ˆX ˆX R p p. cov(x ) ii = 1 n 1 n ((v i ) k ) 2 = 1 n 1 k=1 n (x ik x i ) 2, k=1 c ii is the i th variance and c ij is the (i, j)-covariance.

Centered Sample Matrix x 1 x ˆX =. x n x Covariance Matrix: cov(x ) = 1 n 1 ˆX ˆX R p p. cov(x ) ii = 1 n 1 For i j, (cov(x )) ij = n ((v i ) k ) 2 = 1 n 1 k=1 ( n 1 n 1 n n (x ik x i ) 2, k=1 ) n x ik x jk x i x j k=1 c ii is the i th variance and c ij is the (i, j)-covariance.

Total Variance tvar(x ) of X is trace(c). Let x 1 x n X =. R n p be a centered sample matrix and let R R p p be an orthonormal matrix. If Z R n p is a matrix such that Z = XR, then Z is centered, cov(z) = Rcov(X )R and tvar(z) = tvar(x ).

Properties of the Covariance matrix cov(x ) = 1 n 1 X X is symmetric, so is orthonormally diagonalizable: There exists an orthonormal matrix R such that R cov(x )R = D, which corresponds to a sample matrix Z = XR. Since z 1 z n x 1 Z =. =. R, x n New data form: z i = x i R

Properties of Diagonalized Covariance Matrix cov(z) = D = diag(d 1,..., d p ), then d i is the variance of the i th component; the covariances of the form cov(i, j) are 0. From a statistical point of view, this means that the components i and j are uncorrelated. The principal components of the sample matrix X are the eigenvectors of the matrix cov(x ): the column of R that diagonalizes cov(x ).

Variance Explanation The sum of the elements of D s main diagonal is equal to the total variance tvar(x ). The principal components explain the sources of the total variance: sample vectors grouped around p 1 explain the largest portion of the variance; sample vectors grouped around p 2 explain the second largest portion of the variance, etc.

Example minutes cost E 1 x 1 15 20 E 2 x 2 18 25 E 3 x 3 20 20 E 4 x 4 35 25 E 5 x 5 35 35 E 6 x 6 45 20 E 7 x 7 45 40 E 8 x 8 50 25 E 9 x 9 50 35 E 10 x 10 60 40

Price vs. repair time cost 40 35 30 25 Cost vs.time 20 15 20 25 30 35 40 45 50 55 60 minutes

The principal components of X : r 1 = ( ) 0.37 0.93 corresponding to the eigenvalues and r 2 = ( ) 0.93, 0.37 λ 1 = 35.31 and λ 2 = 268.98.

Centered Data and s 50 Cost vs. time (centered data) cost centered 0 50 80 60 40 20 0 20 40 60 80 minutes centered

The New Variables z 1 = 0.37 (minutes 37.30) 0.93 (price 28.50) z 2 = 0.93 (minutes 37.30) 0.37 (price 28.50),

The New Variables z1 15 10 5 0 5 10 25 20 15 10 5 0 5 10 15 20 z2

The Optimality Theorem of PCA Let X R n p be a centered sample matrix and let R R p p be the orthonormal matrix such that (p 1)R cov(x )R is a diagonal matrix D R p p, where d 11 d pp. Let Q R p l be a matrix whose set of columns is orthonormal and 1 l p, and let W = XQ R n l. Then, trace(cov(w )) is maximized when Q consists of the first l columns of R and is minimized when Q consists of the last l columns of R.

Singular values and vectors Let A C m n be a matrix. A number σ R >0 is a singular value of A if there exists a pair of vectors (u, v) C n C m such that Av = σu and A H u = σv. The vector u is the left singular vector and v is the right singular vector associated to the singular value σ.

SVD Facts The matrices A H A and AA H have the same non-zero eigenvalues. If σ is a singular value of A, then σ 2 is an eigenvalue of both A H A and AA H. Any right singular vector v is an eigenvector of A H A and any left singular vector u is an eigenvector of AA H.

SVD Facts (cont d) If λ be an eigenvalue of A H A and AA H, λ is a real non-negative number. There is v C n such that A H Av = λv. Let u C m be the vector defined by Av = λu. We have A H u = λv, so λ is a singular value of A and (u, v) is a pair of singular vectors associate with the singular value λ. If A C n n is invertible and σ is a singular value of A, then 1 σ is a singular value of the matrix A 1.

Example: SVD of a Vector a 1 a n a =. be a non-zero vector is C n, which can also be regarded as a matrix in C n 1. The square of a singular value of A is an eigenvalue of the matrix ā 1 a 1 ā n a 1 A H ā 1 a 2 ā n a 2 A =.. ā 1 a n ā n a n The unique non-zero eigenvalue of this matrix is a 2 2, so the unique singular value of a is a 2.

A Decomposition of Square Matrices Let A C n n be a unitarily diagonalizable matrix. There exists a diagonal matrix D C n n and a unitary matrix U C n n such that A = U H DU; equivalently, we have A = VDV H, where V = U H. If V = (v 1 v n ), then Av i = d i v i, where D = diag(d 1,..., d n ), so v i is a unit eigenvector of A that corresponds to the eigenvalue d i. implies d 1 v H 1 A = (v 1 v n ). d n v H n A = d 1 v 1 v H 1 + + d n v n v H n,

Decomposition of Rectangular Matrices Theorem (SVD Decomposition Theorem) Let A C m n be a matrix with singular values σ 1, σ 2,..., σ p, where σ 1 σ 2 σ p > 0 and p min{m, n}. There exist two unitary matrices U C m m and V C n n such that A = Udiag(σ 1,..., σ p )V H, (1) where diag(σ 1,..., σ p ) R m n.

SVD Properties Let A = Udiag(σ 1,..., σ p )V H be the SVD decomposition of the matrix A, where σ 1,..., σ p are the singular values of A. The rank of A equals p, the number of the non-zero elements located on the diagonal of D.

The Thin SVD Decomposition Let A C m n be a matrix with singular values σ 1, σ 2,..., σ p, where σ 1 σ 2 σ p > 0 and p min{m, n}. Then, A can be factored as A = UDV H, where U C m p and V C n p are matrices having orthonormal columns and D is the diagonal matrix σ 1 0 0 0 σ 2 0 D =.... 0 0 σ p

SVD Facts The rank-1 matrices of the form u i v H i, where 1 i p are pairwise orthogonal. u i v H i F = 1 for 1 i p.

Eckhart-Young Theorem Let A C m n be a matrix whose sequence of non-zero singular values is σ 1 σ p > 0. A can be written as Let B(k) C m n be A = σ 1 u 1 v H 1 + + σ p u p v H p. B(k) = k σ i u i v H i. i=1 If r k = inf{ A X 2 X C m n and rank(x ) k}, then for 1 k p, where σ p+1 = 0. A B(k) 2 = r k = σ k+1,

Central Issue in IR The computation of sets of documents that contain terms specified by queries submitted by users. Challenges: a concept can be expressed by many equivalent words (synonimy); and the same word may mean different things in various contexts (polysemy); this can lead the retrieval technique to return documents that are irrelevant to the query (false positive) or to omit documents that may be relevant (false negatives).

Documents, Corpora, Terms A corpus is a pair K = (T, D), where T = {t 1,..., t m } is a finite set whose elements are referred to as terms, and D = (D 1,..., D n ) is a set of documents. Each document D i is a finite sequence of terms, D j = (t j1,..., t jk,..., t jlj ).

Documents, Corpora, Terms (cont d) Each term t i generates a row vector (a i1, a i2,..., a in ) referred to as a term vector and each document d j generates a column vector a 1j d j =. a mj. A query is a sequence of terms q Seq(T ) and it is also represented as a vector q 1 q =. q m, where q i = 1 if the term t i occurs in q, and 0 otherwise.

Retrieval When a query q is applied to a corpus K, an IR system that is based on the vector model computes the similarity between the query and the documents of the corpus by evaluating the cosine of the angle between the query vector q and the vectors of the documents of the corpus. For the angle α j between q and d j we have (q, d j ) cos α j =. q 2 d j 2 The IR system returns those documents D j for which this angle is small, that is cos α j t, where t is a parameter provided by the user.

LSI The LSI methods aims to capture relationships between documents motivated by the underlying structure of the documents. This structure is obscured by synonimy, polysemy, the use of insignificant syntactic-sugar words, and plain noise, which is caused my misspelled words or counting errors.

Assumptions: i A R m n is the matrix of a corpus K; ii A = UDV H is an SVD of A; iii rank(a) = p The first p columns of U form an orthonormal basis for range(a), the subspace generated by the vector documents of K; The last n p columns of V constitute an orthonormal basis for nullsp(a). The first p transposed columns of V form an orthonormal basis for the subspace of R n generated by the term vectors of K.

Example K = (T, D), where T = {t 1,..., t 5 } and D = {D 1, D 2, D 3 }. 1 0 0 0 1 0 A = 1 1 1 1 1 0 0 0 1 D 1 and D 2 are fairly similar (they contain two common terms, t 3 and t 4 ).

Example (cont d) Suppose that t 1 and t 2 are synonyms. 1 0 q = 0 0 returns D 1. 0 The matrix representation can not directly account for the equivalence of the terms t 1 and t 2. However, this is an acceptable assumption because these terms appear in the common context {t 3, t 4 } in D 1 and D 2.

An SVD of A A = 0.2787 0.2176 0.7071 0.5314 0.3044 0.2787 0.2176 0.7071 0.5314 0.3044 0.7138 0.3398 0.0000 0.1099 0.6024 0.5573 0.4352 0.0000 0.6412 0.2980 0.1565 0.7749 0.0000 0.1099 0.6024 2.3583 0 0 0 1.1994 0 0.6572 0.6572 0.3690 0 0 1.0000 0.2610 0.2610 0.9294. 0 0 0 0.7071 0.7071 0.0000 0 0 0

Successive Approximations of A: B(1) = σ 1 u 1 v H 1 0.4319 0.4319 0.2425 0.4319 0.4319 0.2425 = 1.1063 1.1063 0.6213 0.8638 0.8638 0.4851, 0.2425 0.2425 0.1362 B(2) = σ 1 u 1 v H 1 + σ 2 u 2 v H 2 0.5000 0.5000 0.0000 0.5000 0.5000 0.0000 = 1.0000 1.0000 1.0000 1.0000 1.0000 0.0000. 0.0000 0.0000 1.0000

A and B(1) 1 0 0 0.4319 0.4319 0.2425 0 1 0 A = 1 1 1 1 1 0 B(1) = 0.4319 0.4319 0.2425 1.1063 1.1063 0.6213 0.8638 0.8638 0.4851 0 0 1 0.2425 0.2425 0.1362

A and B(2) 1 0 0 0.5000 0.5000 0.0000 0 1 0 A = 1 1 1 1 1 0 B(2) = 0.5000 0.5000 0.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.0000 0 0 1 0.0000 0.0000 1.0000

Basic Properties of SVD Approximation: i the rank-1 matrices u i v H i are pairwise orthogonal; ii their Frobenius norms are all equal to 1; iii noise as distributed with relative uniformity with respect to the p orthonormal components of the SVD. By omitting several such components that correspond to relatively small singular values we eliminate a substantial part of the noise and we obtain a matrix that better reflects the underlying hidden structure of the corpus.

Example Let q be a query whose vector is 1 0 q = 0 1. 0 The similarity between q and the document vectors d i, 1 i 3, that constitute the columns of A is cos(q, d 1 ) = 0.8165, cos(q, d 2 ) = 0.482, and cos(q, d 3 ) = 0, suggesting that d 1 is by far the most relevant document for q.

Example (cont d) If we compute the same value of cosine for q and the columns b 1, b 2 and b 3 of the matrix B(2) we have cos(q, b 1 ) = cos(q, b 2 ) = 0.6708, and cos(q, b 3 ) = 0. This approximation of A uncovers the hidden similarity of d 1 and d 2, a fact that is quite apparent from the structure of the matrix B(2).

Recognizing Chinese Characters - N. Wang

Characteristics of Chinese Characters square in shape; more than 5000 characters; set of characters indexed by the number of strokes which varies from 1 to 32.

Difficulties with the Stroke Counts length of stroke is variable; number of strokes does not capture topology of characters; strokes are not line segments; rather, they are calygraphic units.

Two Characters with 9 Strokes Each

Characters are Digitized (black and white) Matrix of Chinese characters

Successive Approximation of Images SVD for Chinese characters The best rank k approximation to A

Scree Diagram: Variation of Singular Values Singular Values

Several Important Sources for Methods G. H. Golub and C. F. Van Loan: Matrix Computations, The Johns Hopkins University Press, Baltimore, 1989 I. T.Jolliffe:, Springer, New York, 2002 R. A. Horn and C. R. Johnson, Matrix, Cambridge University Press, Cambridge, UK, 1996 L. Elden: Matrix and Pattern Recognition, SIAM, Philadelpha, 2007