CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

Similar documents
Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Lecture 5 : Projections

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

Graphical Models for Collaborative Filtering

Machine Learning (CS 567) Lecture 5

Support Vector Machines and Kernel Methods

cross-language retrieval (by concatenate features of different language in X and find co-shared U). TOEFL/GRE synonym in the same way.

Homework 4. Convex Optimization /36-725

Lecture 9: September 28

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Is the test error unbiased for these programs?

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

10-725/ Optimization Midterm Exam

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

ISyE 691 Data mining and analytics

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

1-Bit Matrix Completion

Introduction to Machine Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Constrained Optimization

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

HW1 solutions. 1. α Ef(x) β, where Ef(x) is the expected value of f(x), i.e., Ef(x) = n. i=1 p if(a i ). (The function f : R R is given.

Is the test error unbiased for these programs? 2017 Kevin Jamieson

WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????

Big Data Analytics: Optimization and Randomization

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

ECS289: Scalable Machine Learning

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines

Non-Negative Matrix Factorization

1 Non-negative Matrix Factorization (NMF)

Linearized Alternating Direction Method: Two Blocks and Multiple Blocks. Zhouchen Lin 林宙辰北京大学

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Lecture 2 Part 1 Optimization

10-725/36-725: Convex Optimization Prerequisite Topics

Applications of Linear Programming

Overfitting, Bias / Variance Analysis

Lecture 1: September 25

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Classification Logistic Regression

Low Rank Matrix Completion Formulation and Algorithm

Lecture 2 Machine Learning Review

Using SVD to Recommend Movies

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

ELE539A: Optimization of Communication Systems Lecture 15: Semidefinite Programming, Detection and Estimation Applications

UC Berkeley Department of Electrical Engineering and Computer Science. EECS 227A Nonlinear and Convex Optimization. Solutions 5 Fall 2009

Classification Logistic Regression

Machine Learning Lecture 5

Linear Models in Machine Learning

Algorithmic Foundations Of Data Sciences: Lectures explaining convex and non-convex optimization

Homework 5. Convex Optimization /36-725

Lecture 1 Introduction

Collaborative Filtering Matrix Completion Alternating Least Squares

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

1-Bit Matrix Completion

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Today. Calculus. Linear Regression. Lagrange Multipliers

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Logistic Regression Trained with Different Loss Functions. Discussion

Statistical Methods for Data Mining

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Neural Networks and Deep Learning

Chapter 20. Deep Generative Models

Conditional Gradient (Frank-Wolfe) Method

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Compressed Sensing and Neural Networks

Statistical Data Mining and Machine Learning Hilary Term 2016

Discriminative Models

Linear Models for Regression CS534

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

Non-Negative Matrix Factorization

4 Bias-Variance for Ridge Regression (24 points)

10 Numerical methods for constrained problems

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Andriy Mnih and Ruslan Salakhutdinov

STA141C: Big Data & High Performance Statistical Computing

Fantope Regularization in Metric Learning

Matrix factorization models for patterns beyond blocks. Pauli Miettinen 18 February 2016

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Lecture 6. Regression

Convex Optimization Lecture 16

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECS171: Machine Learning

Machine learning for pervasive systems Classification in high-dimensional spaces

Collaborative Filtering

CS6220: DATA MINING TECHNIQUES

Logistic Regression. COMP 527 Danushka Bollegala

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Transcription:

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

Feature engineering is hard 1. Extract informative features from domain knowledge 2. Suitable feature representation discrete or continuous log-scale or linear scale 3. Interactions between features pairwise constraints high-order complex patterns combinatorial explosion How do we know which features are useful? Features are highly correlated How to combine features?

Deep learning is feature learning Key Idea: 1. Learning complex and meaningful features from raw data 2. Using layer-wise structure to extract features. Higher level features are derived from lower level features. 3. End-to-end learning

Neural Logistic regression y = P exp( T y h) c2class exp( T c h) (x) h(i) = sigmoid(w T i (x))

Conditional Neural Fields y(i-1) y(i) y(i+1)

Matrix Data Matrix data Gene expression Dimensionality reduction and feature selection Low-rank approximation

High-dimensional data Gene Individual p(number of genes) n(number of individuals)

Overfitting disease disease disease disease disease disease disease disease p(number of parameters) n(number of data points)

Component analysis How to understand the main signals from the data? Key assumptions: 1. High-dimensional data lies on a lowerdimensional space (a.k.a low-rank assumption) 2. Projections in the lower-dimensional space describes major properties of the data

Matrix decomposition Matrix factorization and low-rank approximation: Singular value decomposition (SVD): Other generalizations: Nonnegative matrix factorization Sparse matrix factorization

Component Analysis: Tissue-specific gene expression Blood Brain Liver

Deconvolution: Gene expression profile known known unknown Expression = Signature Mixture

Deconvolution: Gene expression profile known unknown known Expression = Signature Mixture

Deconvolution: Gene expression profile known unknown unknown Expression = Signature Mixture

Deconvolution: Algorithms known known unknown Expression = Signature Mixture : Linear regression known unknown known Expression = Signature Mixture : Linear regression known unknown unknown Expression = Signature Mixture :Matrix factorization

Notations number of observations: number of dimensions: n d data matrix: i-th data point: response vector: X 2 R n d x i = {X i,1,x i,2,...,x i,d }2R 1 d y 2 R n

Linear Regression disease disease disease disease disease disease disease disease y i = x T i + i = X j jx i,j + i Fitting error: i

Linear Regression disease disease disease disease disease disease disease disease Assumption: errors are Gaussian noises y = X = arg min X i + (y i X j jx i,j ) 2

Linear Regression

Linear Regression = arg min X i (y i X j jx i,j ) 2 = arg min(y X ) T (y X ) =(X T X) 1 X T y Question: How to derive the closed-form solution?

Underfitting and Overfitting Too simple Too complex Control the complexity or degrees of freedom of the model: 1. Add constraints of the model 2. Add regularizations

Linear regression with constraints simplex constraint

Linear constraints Equality constraints: Q = b Q 2 R m d b 2 R m where Example: sum-to-one constraint: X j =1 Inequality constraints: j R c R 2 R l d c 2 R l where Example: nonnegative constraint: 8j, j 0

Norm constraints Norm: a function that assigns positive length or size to each non-zero vector in a vector space L1 norm 1 = X j j L2 norm 2 =( X j j 2 ) 1/2 Lp norm p =( X j j p ) 1/p

Regularization: norm constraints X j j applet X j 2 j apple t 2

Largrange multiplier min x f(x) subject to g(x) =0 h(x) apple t Remove constraints: min x,, L(x,, )=f(x)+ T g(x)+ T (h(x) t) KKT conditions: O x L(x,, )=0 0 8i, i(h(x) i t i )=0

Largrange multiplier Problem 1 min ky X k 2 2 subject to k k 2 2 apple t Problem 2 min ky X k 2 2 + k k 2 2 TODO: Solving these two problems are equivalent.

L2 regularization min ky X k 2 2 + k k 2 2 Taking the derivative of RHS: =(X T X + I) 1 X T y Comments: 1. Introducing the regularization make the regression numerically more robust 2. The regularization term controls the size of the solution 3. Practically, the regularization coefficient is chosen by cross-validation

Probabilistic (Bayesian) interpretation Suppose we have a Gaussian prior N(0, 1 I) Since we assume that the error term is also a Gaussian random variable, then the data likelihood can be written as L( ) / exp( ky X k 2 2) The posterior mean would be =(X T X + I) 1 X T y

Sparse regularization = arg min ky X k 2 2 + k k 0 where the L0 norm is the number of non-zeros in the vector Combinatorial explosion: we need to enumerate all possible subsets

Sparse regularization = arg min ky X k 2 2 + k k 0 where the L0 norm is the number of non-zeros in the vector Combinatorial explosion: we need to enumerate all possible subsets Solution: Relaxation to L1 norm = arg min ky X k 2 2 + k k 1 Comments: 1. It allows efficient feature selection 2. Non-smooth objective function

TODO: sub-gradient and LASSO

TODO: Convex optimization

Convex Optimization min x f(x) subject to g(x) =0 h(x) apple t The above problem is convex if: 1. the objective function is convex 2. the feasible set defined by the constraints is convex Local optimum is global optimum

Convex Optimization Solvers

Matrix factorization Matrix factorization and low-rank approximation: Variants: Nonnegative matrix factorization Sparse matrix factorization

Matrix Norm Frobenius Norm X kak F =( i s X Nuclear Norm = k X j 2 k A 2 i,j) 1/2 = trace(a T A) 1/2 L2 norm of matrices kak = trace( p A T A)= X k k L1 norm of matrices

Matrix factorization Matrix factorization and low-rank approximation: X 2 R n m W 2 R n r H 2 R r m We now want to find optimal W and H min kx W,H WHk2 F

How to solve this optimization problem? min kx W,H WHk2 F

Regularized Matrix Factorization min kx W,H WHk2 F We assume the ranks of W and H are no larger than r: min kx W,H WHk2 F + (kw k 2 F + khk 2 F ) W 2 R n r H 2 R r m

Regularized matrix approximation We want to find a low-rank approximation with an implicit regularization: min Y kx Y k2 F + ky k L1 -regularization We can also solve this explicit low-rank problem min Y kx WHk2 F + 2 (kw k 2 F + khk 2 F ) L2 -regularization + low-rank

Nonnegative Matrix Factorization X 2 R n m W 2 R n r H 2 R r m We now want to find optimal W and H X WH subject to 8k, j H k,j 0 8i, k W i,k 0

Nonnegative Matrix Factorization Minimize the Frobenius norm min kx W,H WHk2 F Minimize the KL divergence where min D(XkWH) W,H D(AkB) = X i,j (A i,j log A i,j B i,j A i,j + B i,j )

Coordinate optimization X Optimize one variable while fixing values of all other variables where a and r are column and row vectors of X

Coordinate optimization Gradient calculation: g(h ij )=h ij h ij (W T X) ij (W T WH) ij g(w ij )=w ij w ij (XH T ) ij (WHH T ) ij gradient descent step may violate the nonnegative constraint. w new ij = w ij + g(w ij ) h new ij = h ij + g(h ij ) g w +

Coordinate optimization Projected gradient descent w new ij = max(0,w ij + g(w ij )) h new ij = max(0,h ij + g(h ij )) g + w Multiplicative updates w new ij w ij (XH T ) ij (WHH T ) ij h new ij h ij (W T X) ij (W T WH) ij

TODO: derive the multiplicative update rules for KL divergence

Netflix Challenge