Multi-View Regression via Canonincal Correlation Analysis

Similar documents
Problem statement. The data: Orthonormal regression with lots of X s (possible lots of β s are zero: Y i = β 0 + β j X ij + σz i, Z i N(0, 1),

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Dimension Reduction Methods

Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares

Streaming Feature Selection using IIC

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Multi-View Dimensionality Reduction via Canonical Correlation Analysis

Regression I: Mean Squared Error and Measuring Quality of Fit

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Linear Methods for Regression. Lijun Zhang

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

CMSC858P Supervised Learning Methods

Regression, Ridge Regression, Lasso

Probability models for machine learning. Advanced topics ML4bio 2016 Alan Moses

Discussion. Dean Foster. NYC

A Least Squares Formulation for Canonical Correlation Analysis

Generative Clustering, Topic Modeling, & Bayesian Inference

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Linear regression methods

Machine Learning 11. week

Online Bounds for Bayesian Algorithms

Linear Model Selection and Regularization

Lecture 14: Shrinkage

Introduction to Machine Learning

Bayes Estimators & Ridge Regression

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Expectation Maximization

Statistical Data Mining and Machine Learning Hilary Term 2016

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Regularization: Ridge Regression and the LASSO

Linear Models in Machine Learning

Principal Component Analysis (PCA)

Machine Learning for OR & FE

Variable Selection in Wide Data Sets

Variable Selection in Wide Data Sets

CPSC 340: Machine Learning and Data Mining

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Data Mining Stat 588

The prediction of house price

Learning with Singular Vectors

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Fast Linear Algorithms for Machine Learning

Dimensionality reduction

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

ISyE 691 Data mining and analytics

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

PRINCIPAL COMPONENTS ANALYSIS

Machine Learning for OR & FE

CS281 Section 4: Factor Analysis and PCA

Machine Learning, Fall 2009: Midterm

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Association studies and regression

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

Classification 2: Linear discriminant analysis (continued); logistic regression

Regression with Numerical Optimization. Logistic

Least Squares Regression

STATISTICS 407 METHODS OF MULTIVARIATE ANALYSIS TOPICS

Collaborative Filtering Matrix Completion Alternating Least Squares

Big Data Analytics: Optimization and Randomization

Dimension Reduction (PCA, ICA, CCA, FLD,

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

10-701/ Machine Learning - Midterm Exam, Fall 2010

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2016

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Day 4: Shrinkage Estimators

Calibration and Nash Equilibrium

Sparse representation classification and positive L1 minimization

Machine Learning (CSE 446): Unsupervised Learning: K-means and Principal Component Analysis

What is semi-supervised learning?

Course in Data Science

Today. Calculus. Linear Regression. Lagrange Multipliers

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Terminology for Statistical Data

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Recap from previous lecture

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Advanced Machine Learning & Perception

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Quick Introduction to Nonnegative Matrix Factorization

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Classifier Complexity and Support Vector Classifiers

Introduction to Machine Learning Midterm Exam

MS-C1620 Statistical inference

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

Prediction of double gene knockout measurements

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Linear Methods for Prediction

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Lecture VIII Dim. Reduction (I)

Least Squares Regression

Issues and Techniques in Pattern Classification

High-dimensional regression modeling

Biostatistics Advanced Methods in Biostatistics IV

Parameter Norm Penalties. Sargur N. Srihari

Lecture 6: Methods for high-dimensional problems

Properties of the least squares estimates

A Quick Introduction to Row Reduction

Transcription:

Multi-View Regression via Canonincal Correlation Analysis Dean P. Foster U. Pennsylvania (Sham M. Kakade of TTI)

Model and background p ( i n) y i = X ij β j + ɛ i ɛ i iid N(0, σ 2 ) i=j Data mining and Machine learning: p n

Model and background ( i n) y i = Can t fit model if p n: p X ij β j + ɛ i ɛ i iid N(0, σ 2 ) i=j Trick: assume most β i are in fact zero Variable selection: ˆβ RIC i = { 0 if ˆβi SE i 2 log p ˆβ i otherwise Basically just stepwise regression and Bonferroni Can be justified by risk ratios (Donoho and Johnstone 94, Foster and George 94)

Model and background ( i n) y i = p X ij β j + ɛ i ɛ i iid N(0, σ 2 ) I ve played with lots of alternatives: FDR instead of RIC: 2 log p 2 log(p/q) empirical Bayes (George and Foster, 2000) Cauchy prior (Foster and Stine, 200x) i=j regression logistic regression IID independence independence block independence (with Dongyu Lin)

Model and background ( i n) y i = p X ij β j + ɛ i ɛ i iid N(0, σ 2 ) i=j Where do this many variables come from? Missing value codes Interactions Transformations Example (with Bob) Personal Bankruptcy 350 basic variables all interactions, missing value codes, etc lead to 67,000 variables about 1 million clustered cases Ran stepwise logistic regression using FDR

Model and background p ( i n) y i = X ij β j + ɛ i ɛ i iid N(0, σ 2 ) i=j Summary of current state of the art: We can generate many non-linear X s We can select the good ones large lists Isn t the problem solved?

Model and background p ( i n) y i = X ij β j + ɛ i ɛ i iid N(0, σ 2 ) i=j There is always room for finding new X s

New features Current methods of finding X s are non-linear Can we find new linear combinations of existing X s? Hope, use linear theory Hope, fast CPU Hope, new theory

Semi-supervised Semi-supervised learning is: Y s are expensive X s are cheap We get n rows of Y But also m free rows of just X s Called, semi-supervised learning Can this help?

Usual data table for data mining Y (n 1) X (n p) with p n

With unlabeled data m rows of unlabeled data: Y n 1 X (n + m) p

With alternative X s m rows of unlabeled data, and two sets of equally useful X s: Y n 1 X (n + m) p Z (n + m) p With: m n

Examples Person identification Y = identity X = Profile photo Z = front photo Topic identification (medline) Y = topic X = abstract Z = text The web: Y = classification X = content (i.e. words) Z = hyper-links We will call these the multi-view setup

A Multi-View Assumption Define σx 2 = E[Y E(Y X)] 2 σz 2 = E[Y E(Y Z )] 2 σx,z 2 = E[Y E(Y X, Z )] 2 (We will take conditional expectations to be linear) Assumption Y,X, and Z satisfy the α-multiview assumption if: σx 2 σx,z(1 2 + α) σz 2 σx,z(1 2 + α) In other words, σ 2 x σ 2 z σ 2 x,z Views X and Z are redundant (i.e. highly collinear)

The Multi-View Assumption in the Linear Case The views are redundant. Satisfied if each view predict Y well. No conditional independence assumptions (i.e. Bayes nets) No coordinates, norm, eigenvalues, or dimensionality assumptions.

Both estimators are similar Lemma Under the α-multiview assumption E[(E(Y X) E(Y Z )) 2 ] 2ασ 2 Idea: find directions in X and Z that are highly correlated CCA solves this problem already!

What if we run CCA on X and Z? CCA = canonical correlation analysis Find the directions that are most highly correlated Very close to PCA (principal components analysis) Generates coordinates for data End up with canonical coordinates for both X s and Z s Numerically an Eigen-value problem

CCA form Running CCA on X and Z generates a coordinate space for both of them. We will call this CCA form.

CCA form Running CCA on X and Z generates a coordinate space for both of them. We will call this CCA form. Definition X i, and Z j, are in CCA form if X i are orthonormal Z i are orthonormal Xi T Z j = 0 for i j Xi T Z i = λ i, λ 1 λ 2 λ p 0 (This is the output of running CCA on the original X s and Z s.)

CCA form as a covariance matrix [ ΣXX Σ Σ = XZ Σ ZX Σ ZZ The canonical correlations are λ i : D = ] [ I D D I λ 1 0 0... 0 λ 2 0... 0 0 λ 3....... ]

The Main Result Theorem Let ˆβ be the Ridge regression estimator with weights induced by the CCA. Then ( ) Risk( ˆβ) 5α + λ 2 i n σ 2

The Main Result Theorem Let ˆβ be the Ridge regression estimator with weights induced by the CCA. Then ( ) Risk( ˆβ) 5α + λ 2 i n CCA-ridge regression is to minimize least squares plus a penalty of: 1 λ i βi 2 λ i i Large penalties in the less correlated directions. λ i s are the correlations A shrinkage estimator. σ 2

The Main Result Theorem Let ˆβ be the Ridge regression estimator with weights induced by the CCA. Then ( ) Risk( ˆβ) Recall α is the multiview property: 5α + λ 2 i n σ 2 x σ 2 x,z(1 + α) σ 2 z σ 2 x,z(1 + α) σ 2

The Main Result Theorem Let ˆβ be the Ridge regression estimator with weights induced by the CCA. Then ( ) 5α is the bias P λ 2 i is variance n Risk( ˆβ) 5α + λ 2 i n σ 2

Doesn t fit my style I like feature selection! On to theorem 2

Alternative version Theorem For ˆβ be the CCA-testimator: Risk( ˆβ) ( 2 α + d ) σ 2 n where d is the number of λ i for which λ i 1 α.

Alternative version Theorem For ˆβ be the CCA-testimator: Risk( ˆβ) ( 2 α + d ) σ 2 n where d is the number of λ i for which λ i 1 α. The CCA testimator: β i = { MLE(βi ) if λ i 1 α 0 else (1)

Alternative version Theorem For ˆβ be the CCA-testimator: Risk( ˆβ) ( 2 α + d ) σ 2 n where d is the number of λ i for which λ i 1 α. Do we need to know α? We can try features in order Use promiscuous rule to add variables (i.e. AIC) Will do as well as theorem, and possibly much better Doesn t mix all that well with stepwise regression

Conclusions Trade off between two theorems? Experimental work?

Conclusions Trade off between two theorems? Experimental work? Soon!