BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

Similar documents
Machine Learning for Data Science (CS 4786)

Machine Learning for Data Science (CS 4786)

Machine Learning for Data Science (CS4786) Lecture 4

Session 5. (1) Principal component analysis and Karhunen-Loève transformation

Machine Learning for Data Science (CS4786) Lecture 9

DIMENSION REDUCTION FOR POWER SYSTEM MODELING USING PCA METHODS CONSIDERING INCOMPLETE DATA READINGS

Assumptions. Motivation. Linear Transforms. Standard measures. Correlation. Cofactor. γ k

Support vector machine revisited

Chapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian

10/2/ , 5.9, Jacob Hays Amit Pillay James DeFelice

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

Linear Classifiers III

Non-linear Feature Extraction by the Coordination of Mixture Models

Intelligent Systems I 08 SVM

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

Apply change-of-basis formula to rewrite x as a linear combination of eigenvectors v j.

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

STATS 306B: Unsupervised Learning Spring Lecture 8 April 23

Variable selection in principal components analysis of qualitative data using the accelerated ALS algorithm

Dimensionality Reduction vs. Clustering

PAijpam.eu ON TENSOR PRODUCT DECOMPOSITION

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Chapter 7. Support Vector Machine

PCA SVD LDA MDS, LLE, CCA. Data mining. Dimensionality reduction. University of Szeged. Data mining

Optimum LMSE Discrete Transform

Factor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis

Dimensionality Reduction Algorithms!! Lecture 15!

6.867 Machine learning, lecture 7 (Jaakkola) 1

BHW #13 1/ Cooper. ENGR 323 Probabilistic Analysis Beautiful Homework # 13

Sensitivity Analysis of Daubechies 4 Wavelet Coefficients for Reduction of Reconstructed Image Error

The DOA Estimation of Multiple Signals based on Weighting MUSIC Algorithm

Cov(aX, cy ) Var(X) Var(Y ) It is completely invariant to affine transformations: for any a, b, c, d R, ρ(ax + b, cy + d) = a.s. X i. as n.

IP Reference guide for integer programming formulations.

Research Article Outlier Removal Approach as a Continuous Process in Basic K-Means Clustering Algorithm

Signal Processing in Mechatronics

Introduction to Artificial Intelligence CAP 4601 Summer 2013 Midterm Exam

Linear Regression Demystified

A Risk Comparison of Ordinary Least Squares vs Ridge Regression

Probabilistic Unsupervised Learning

For a 3 3 diagonal matrix we find. Thus e 1 is a eigenvector corresponding to eigenvalue λ = a 11. Thus matrix A has eigenvalues 2 and 3.

Probabilistic Unsupervised Learning

Pairwise-Covariance Linear Discriminant Analysis

Overview. Structured learning for feature selection and prediction. Motivation for feature selection. Outline. Part III:

FMA901F: Machine Learning Lecture 4: Linear Models for Classification. Cristian Sminchisescu

Study the bias (due to the nite dimensional approximation) and variance of the estimators

Filter banks. Separately, the lowpass and highpass filters are not invertible. removes the highest frequency 1/ 2and

MATHEMATICS. The assessment objectives of the Compulsory Part are to test the candidates :

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

PH 411/511 ECE B(k) Sin k (x) dk (1)

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Frequency Domain Filtering

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

On the convergence rates of Gladyshev s Hurst index estimator

Estimation of Population Mean Using Co-Efficient of Variation and Median of an Auxiliary Variable

First, note that the LS residuals are orthogonal to the regressors. X Xb X y = 0 ( normal equations ; (k 1) ) So,

Information-based Feature Selection

PH 411/511 ECE B(k) Sin k (x) dk (1)

Expectation-Maximization Algorithm.

10-701/ Machine Learning Mid-term Exam Solution

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

Chandrasekhar Type Algorithms. for the Riccati Equation of Lainiotis Filter

REGRESSION (Physics 1210 Notes, Partial Modified Appendix A)

ECON 3150/4150, Spring term Lecture 3

Linear Support Vector Machines

Lainiotis filter implementation. via Chandrasekhar type algorithm

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Linear Regression Models

Optimally Sparse SVMs

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Similarity Solutions to Unsteady Pseudoplastic. Flow Near a Moving Wall

The random version of Dvoretzky s theorem in l n

Solution of Differential Equation from the Transform Technique

DERIVING THE 12-LEAD ECG FROM EASI ELECTRODES VIA NONLINEAR REGRESSION

State Space Representation

5.1 Review of Singular Value Decomposition (SVD)

The Choquet Integral with Respect to Fuzzy-Valued Set Functions

Lecture 24: Variable selection in linear models

AAEC/ECON 5126 FINAL EXAM: SOLUTIONS

Symmetric Matrices and Quadratic Forms

Binary classification, Part 1

Monte Carlo Optimization to Solve a Two-Dimensional Inverse Heat Conduction Problem

SVM for Statisticians

The Method of Least Squares. To understand least squares fitting of data.

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

CMSE 820: Math. Foundations of Data Sci.

Lecture 2 Clustering Part II

Modified Ratio Estimators Using Known Median and Co-Efficent of Kurtosis

MATHEMATICS. The assessment objectives of the Compulsory Part are to test the candidates :

Kernel PCA Feature Extraction of Event-Related Potentials for Human Signal Detection Performance

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Support Vector Machines and Kernel Methods

Topics in Eigen-analysis

THE KALMAN FILTER RAUL ROJAS

Discrete Orthogonal Moment Features Using Chebyshev Polynomials

V. Nollau Institute of Mathematical Stochastics, Technical University of Dresden, Germany

Summary: CORRELATION & LINEAR REGRESSION. GC. Students are advised to refer to lecture notes for the GC operations to obtain scatter diagram.

Transcription:

BIOINF 585: Machie Learig for Systems Biology & Cliical Iformatics Lecture 14: Dimesio Reductio Jie Wag Departmet of Computatioal Medicie & Bioiformatics Uiversity of Michiga 1

Outlie What is feature reductio? Why feature reductio? Feature reductio algorithms Priciple Compoet Aalysis (PCA) 2

What is feature reductio Feature reductio refers to the mappig of the origial high-dimesioal data oto a lower-dimesioal space. Criterio for feature reductio ca be differet based o differet problem settigs. Usupervised settig: miimize the iformatio loss Supervised settig: maximize the class discrimiatio Give a set of data poits x 1, x 2,, x with p variables, we compute a liear trasformatio G R p d : x R p y = G T x R d, d p. 3

What is feature reductio Origial data reduced data Liear trasformatio G T R d p y R d x R p G R p d : x R p y = G T x R d, d p. 4

High-dimesioal data Gee expressio Face images Hadwritte digits 5

Feature reductio vs Feature selectio Feature reductio We create a small umber of ew features based o the full set of origial features The trasformed features ca be liear combiatios of the origial oes Feature selectio We oly make use of a subset of the origial features (sparse models). 6

Why feature reductio Most machie learig ad data miig techiques may ot be effective for high-dimesioal data Curse of dimesioality The itrisic dimesio may be small For example, the umber of gees resposible for a certai type of disease may be small 7

Why feature reductio Visualizatio: projectio of high-dimesioal data oto 2D or 3D. Data compressio: efficiet storage ad retrieval Feature extractio: sythesis of a smaller set of ew features that are hopefully more useful 8

Feature reductio algorithms Usupervised Idepedet compoet aalysis (ICA) Priciple compoet aalysis (PCA) Caoical correlatio aalysis (CCA) Supervised Liear discrimiat aalysis (LDA) Semi-supervised Research topic 9

Feature reductio algorithms Liear Priciple compoet aalysis (PCA) Liear discrimiat aalysis (LDA) Caoical correlatio aalysis (CCA) Noliear Noliear feature reductio usig kerels Maifold learig 10

Priciple Compoet Aalysis Priciple compoet aalysis (PCA) Reduce the dimesioality of a data set by fidig a ew set of variables, smaller tha the origial set of variables Retais most of the sample s iformatio Useful for the compressio ad classificatio of data By iformatio, we mea the variace preset i the sample. The ew variables, called priciple compoets (PCs), are ucorrelated, ad are ordered by the fractio of the total iformatio each retais. 11

How to derive PCs Maximize the variace of the data samples that are projected ito a subspace with a fixed dimesio. Miimize the recostructio error. 12

How to derive PCs Maximize the variace of the data samples that are projected ito a subspace with a fixed dimesio. Miimize the recostructio error. 13

Sample variace average x 1 x 2 x 3 x 4 x x = 1 i=1 x i Sample variace: 1 i=1 x i x 2 2 14

The first priciple compoet The first priciple compoet is the vector p 1 such that the variace of the projected data oto p 1 is maximized. x i p 1 : p 1 = 1 The variace of the projected data o to p 1 is 1 i=1 z i z 2 2 z i = p 1, x i p 1 = 1 i=1 p 1, x i p 1 p 1, x p 1 2 2 For projectios, please refer to G. Strag: Lecture 15 & 16. http://ocw.mit.edu/courses/mathematics/18-06-liearalgebra-sprig-2010/idex.htm = 1 i=1 p 1, x i x 2 p 1 2 2 = 1 15

The first priciple compoet The variace of the projected data o to p 1 is 1 i=1 p 1, x i x 2 covariace matrix = 1 p T 1 x i x p T 1 x i x i=1 = 1 p T 1 x i x x i x T p 1 i=1 S = 1 x i x x i x T i=1 symmetric = p 1 T 1 i=1 x i x x i x T p 1 = p 1 T Sp 1 16

The first priciple compoet Fidig the first priciple compoet boils dow to the followig optimizatio problem. max p 1 T We solve this problem by Lagrage multiplier. Sp 1 p 1 : p 1 =1 https://www.cs.berkeley.edu/~klei/papers/lag rage-multipliers.pdf L p 1, λ 1 = p 1 T Sp 1 + λ 1 1 p 1 T p 1 Lagrage fuctio p 1 L p 1, λ 1 = 2Sp 1 2λ 1 p 1 = 0 Optimality coditio Sp 1 = λ 1 p 1 p 1 T Sp 1 = λ 1 p 1 T p 1 = λ 1 p 1 is a eigevector of S λ 1 is the largest eigevalue of S 17

The first priciple compoet Projectio of two dimesioal data usig oe dimesioal PCA. Figure is from D. Barber, Bayesia Reasoig ad Machie Learig. 18

The secod priciple compoet The secod priciple compoet is the vector p 2 such that p 2 is perpedicular to p 1 ad the variace of the projected data oto p 1 is maximized. max p T 2 Sp 2 p T 1 p 2 =0, p 2 =1 L p 2, λ 2, μ = p 2 T Sp 2 + λ 2 1 p 2 T p 2 μp 1 T p 2 p 2 L p 2, λ 2 = 2Sp 2 2λ 2 p 2 μp 1 = 0 2p 1 T Sp 2 2λ 2 p 1 T p 2 μp 1 T p 1 = 0 2λ 1 p 1 T p 2 2λ 2 p 1 T p 2 μp 1 T p 1 = 0 μ = 0 Sp 2 = λ 2 p 2 p 2 T Sp 2 = λ 2 p 2 T p 2 = λ 2 p 2 is a eigevector of S λ 2 is the secod largest eigevalue of S 19

PCA for image compressio d=1 d=2 d=4 d=8 d=16 d=32 d=64 d=100 Origial Image 20

Refereces C. Burges. Dimesio Reductio: A Guided Tour. Foudatios ad Treds i Machie Learig, 2009. J. Cuigham ad Z. Ghahramai. Liear Dimesioality Reductio: Survey, Isights, ad Geeralizatios. Joural of Machie Learig Research, 2015. 21