Introduction to Machine Learning. Recitation 11

Similar documents
3.3 Variational Characterization of Singular Values

Machine Learning Basics: Estimators, Bias and Variance

Principal Components Analysis

COS 424: Interacting with Data. Written Exercises

Feature Extraction Techniques

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Block designs and statistics

A Simple Regression Problem

Bootstrapping Dependent Data

Kernel Methods and Support Vector Machines

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

An l 1 Regularized Method for Numerical Differentiation Using Empirical Eigenfunctions

CS Lecture 13. More Maximum Likelihood

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Support Vector Machines. Maximizing the Margin

1 Bounding the Margin

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are,

Machine Learning: Fisher s Linear Discriminant. Lecture 05

Unsupervised Learning: Dimension Reduction

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Boosting with log-loss

Optimal Jamming Over Additive Noise: Vector Source-Channel Case

Pattern Recognition and Machine Learning. Artificial Neural networks

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

Multivariate Methods. Matlab Example. Principal Components Analysis -- PCA

On the Impact of Kernel Approximation on Learning Accuracy

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Non-Parametric Non-Line-of-Sight Identification 1

Topic 5a Introduction to Curve Fitting & Linear Regression

Slide10. Haykin Chapter 8: Principal Components Analysis. Motivation. Principal Component Analysis: Variance Probe

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis

Support Vector Machines. Goals for the lecture

Principles of Optimal Control Spring 2008

Bayes Decision Rule and Naïve Bayes Classifier

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

OBJECTIVES INTRODUCTION

Least Squares Fitting of Data

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS

Multi-Scale/Multi-Resolution: Wavelet Transform

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup)

Principal Component Analysis

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

Multiple Testing Issues & K-Means Clustering. Definitions related to the significance level (or type I error) of multiple tests

Bipartite subgraphs and the smallest eigenvalue

Distributed Subgradient Methods for Multi-agent Optimization

Interactive Markov Models of Evolutionary Algorithms

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Computational and Statistical Learning Theory

A New Class of APEX-Like PCA Algorithms

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Estimating Parameters for a Gaussian pdf

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

On the Use of A Priori Information for Sparse Signal Approximations

Support Vector Machines MIT Course Notes Cynthia Rudin

Ch 12: Variations on Backpropagation

Linear Transformations

A Simplified Analytical Approach for Efficiency Evaluation of the Weaving Machines with Automatic Filling Repair

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Lower Bounds for Quantized Matrix Completion

On Hyper-Parameter Estimation in Empirical Bayes: A Revisit of the MacKay Algorithm

Chapter 6 1-D Continuous Groups

Towards Gauge Invariant Bundle Adjustment: A Solution Based on Gauge Dependent Damping

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Soft-margin SVM can address linearly separable problems with outliers

A remark on a success rate model for DPA and CPA

EMPIRICAL COMPLEXITY ANALYSIS OF A MILP-APPROACH FOR OPTIMIZATION OF HYBRID SYSTEMS

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data

paper prepared for the 1996 PTRC Conference, September 2-6, Brunel University, UK ON THE CALIBRATION OF THE GRAVITY MODEL

Research Article Robust ε-support Vector Regression

Lecture 13 Eigenvalue Problems

3.8 Three Types of Convergence

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

The Methods of Solution for Constrained Nonlinear Programming

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

PAC-Bayes Analysis Of Maximum Entropy Learning

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

Leonardo R. Bachega*, Student Member, IEEE, Srikanth Hariharan, Student Member, IEEE Charles A. Bouman, Fellow, IEEE, and Ness Shroff, Fellow, IEEE

Introduction to Kernel methods

Highly Robust Error Correction by Convex Programming

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Prediction by random-walk perturbation

Lecture 21. Interior Point Methods Setup and Algorithm

Pattern Recognition and Machine Learning. Artificial Neural networks

W-BASED VS LATENT VARIABLES SPATIAL AUTOREGRESSIVE MODELS: EVIDENCE FROM MONTE CARLO SIMULATIONS

arxiv: v1 [stat.ot] 7 Jul 2010

Stochastic Subgradient Methods

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine

A note on the multiplication of sparse matrices

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics

Sharp Time Data Tradeoffs for Linear Inverse Problems

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

When Short Runs Beat Long Runs

Probability Distributions

Optimal quantum detectors for unambiguous detection of mixed states

C na (1) a=l. c = CO + Clm + CZ TWO-STAGE SAMPLE DESIGN WITH SMALL CLUSTERS. 1. Introduction

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

Transcription:

Introduction to Machine Learning Lecturer: Regev Schweiger Recitation Fall Seester Scribe: Regev Schweiger. Kernel Ridge Regression We now take on the task of kernel-izing ridge regression. Let x,..., x R d, and y R. Recall that ridge regression solves the following proble: arg in a R d y Xa 2 + λ a 2 where λ is penalty coefficient. Equating the gradient to 0 result in the solution we have seen in class: â = (X T X + λi d ) X T y Note that (X T X + λi d )X T = X T XX T + λx T = X T (XX T + λi n ) Multiplying (X T X + λi d ) at the left and (XX T + λi n ) at the right, we get X T (XX T + λi n ) = (X T X + λi d ) X T Therefore, the optial solution is equivalently, â = X T (XX T + λi n ) y Given a new point x, our regression estiate will be x T â = x T X T (XX T + λi n ) y We would now like to ebed our points to a space H, with x i φ(x i ), and perfor ridge regression after the transforation. It is easy to see that, given the forulation above, we can replace all the expressions involving X with kernel expressions. First define K as the atrix for which K i,j = K(x i, x j ). Siilarly define k as the vector for which k i = φ(x) T φ(x i ) = K(x, x i ). Thus, given a new point x, our regression estiate will be φ(x) T â = k T ( K + λi n ) y Note that, as usual, we cannot write down â explicitly, but we can apply it to the transforation of new points.

2 Lecture.2 PCA as axiizing variance We have seen how the PCA algorith can be derived in the context of iniizing the reconstruction error. More forally, assue we have a set of input vectors x,..., x, where x i R d. Denote the principal coponents by the coluns of V, as v,..., v r ; the orthonorality constraints iply that V T V = I. The PCA proble was: V = arg in V R d r x i V V T x i 2. (.) i= We now consider another possible criterion. Let s assue r =, that is, we would like to find the best line in soe sense. One intuitive criterion is the line, which if we project all points on, will give axial epirical variance. The epirical variance of a set of easureents, a,..., a, is (a i i= a j ) 2 j= Assue without loss of generality that the data points are centered at zero, that is x i = 0 i= If that is not the case, we ean-center the data. Therefore, it is easy to say that i= vt x i = 0 for each v. Therefore, the epirical variance of the set of projection is siply the ean of squares. Therefore, the criterion we like for the first direction is: v = arg ax v = (v T x i ) 2 For the next direction, we would like to capture the variance on directions we have not yet seen. Forally, we would like directions orthogonal to previous directions. Assue we found already v,..., v r. Then, the r-th direction is: i= v r = argax v =,v v,...,v r (v T x i ) 2 We can instead forulate that to find all r directions together, to get: i= argax V R d r,v T V =I r j= (vj T x i ) 2 i=

.3. PCA EXAMPLE 3 It is easy to see that the optiization function is: r j= (vj T x i ) 2 = i= r (vj T x i ) 2 = i= j= V T x i 2 i= To suarize, a sensible criterion for diensionality reduction would be to choose V so that the variance of projections is axiized, i.e., intuitively the structure of the data is preserved as uch as possible: argax V R d r,v T V =I We note, however, the following equality, based on Pythagoras: V T x i 2. (.2) i= x i 2 = V V T x i 2 + x i V V T x i 2. And it is easy to see that V V T x i 2 = V T x i 2 due to the orthonorality of V. Since x i does not depend on V, we see that iniizing the reconstruction error is equivalent to axiizing the variance. The goal in principal coponent analysis (PCA) is therefore to iniize the reconstruction error (see Equation.), and to axiize the projected variance (Equation.2). Eigenvalues. An iportant observation is the following. We know that the solution of PCA is the eigenvectors of the epirical covariance atrix. What are the eigenvalues? The variance axiization criterion gives an intuitive interpretation. Let λ,..., λ n be the eigenvalues of C = i= x ix T i ; i.e., Cv j = λ i v j. We seeked to axiize i= (vt x i ) 2. Plugging in v = v j, we get ( (vj T x i ) 2 = ) vt j x i x T i v j = vj T Cv j = λ j i= i= That is, λ j, the j-th eigenvalue, is the epirical variance of the projection on the j-th principal axis..3 PCA exaple.3. Background The DNA in our cells contains long chains of four cheical building blocks adenine, thyine, cytosine, and guanine, abbreviated A, T, C, and G. More than 6 billion of these

4 Lecture cheical bases, strung together in 23 pairs of chroosoes, exist in a huan cell. These genetic sequences contain inforation that influences our physical traits, our likelihood of suffering fro disease, and the responses of our bodies to substances that we encounter in the environent. The genetic sequences of different people are rearkably siilar. When the chroosoes of two huans are copared, their DNA sequences can be identical for hundreds of bases. But at about one in every,200 bases, on average, the sequences will differ. Differences in individual bases are by far the ost coon type of genetic variation. One person ight have an A at that location, while another person has a G. These genetic differences are known as single nucleotide polyorphiss, or SNPs (pronounced snips ). There are approxiately 0 illion SNPs estiated to occur coonly in the huan genoe. Each distinct spelling of a chroosoal region is called an allele, and a collection of alleles in a person s chroosoes is known as a genotype. In the ost coon case, there are only two alleles for all population at each SNP position. Data describing the genotype data for individuals, often does not specify the bases explicitly. Instead, one allele (per position) is selected as a reference allele. Then, at that position, the nuber of non-reference alleles is presented: 0 if both alleles in that position, in the chroosoe pair, were identical to the reference allele for that position; if only one of the was the reference allele; and 2 if neither were the reference alleles..3.2 Novebre et al., 2008 In the work of Novebre et al. 2008, Nature, 3,92 European individuals were genotyped at 500,568 positions (soe details are oitted for siplicity). They applied PCA with r = 2 and presented the projections of all genoes on these two principal axes:

.3. PCA EXAMPLE 5 Each individuals is denoted by colored two-letters, denoting their country of origin. It can be seen that the projections reflect the geography of Europe well.