CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu
Feature engineering is hard 1. Extract informative features from domain knowledge 2. Suitable feature representation discrete or continuous log-scale or linear scale 3. Interactions between features pairwise constraints high-order complex patterns combinatorial explosion How do we know which features are useful? Features are highly correlated How to combine features?
Deep learning is feature learning Key Idea: 1. Learning complex and meaningful features from raw data 2. Using layer-wise structure to extract features. Higher level features are derived from lower level features. 3. End-to-end learning
Neural Logistic regression y = P exp( T y h) c2class exp( T c h) (x) h(i) = sigmoid(w T i (x))
Conditional Neural Fields y(i-1) y(i) y(i+1)
Matrix Data Matrix data Gene expression Dimensionality reduction and feature selection Low-rank approximation
High-dimensional data Gene Individual p(number of genes) n(number of individuals)
Overfitting disease disease disease disease disease disease disease disease p(number of parameters) n(number of data points)
Component analysis How to understand the main signals from the data? Key assumptions: 1. High-dimensional data lies on a lowerdimensional space (a.k.a low-rank assumption) 2. Projections in the lower-dimensional space describes major properties of the data
Matrix decomposition Matrix factorization and low-rank approximation: Singular value decomposition (SVD): Other generalizations: Nonnegative matrix factorization Sparse matrix factorization
Component Analysis: Tissue-specific gene expression Blood Brain Liver
Deconvolution: Gene expression profile known known unknown Expression = Signature Mixture
Deconvolution: Gene expression profile known unknown known Expression = Signature Mixture
Deconvolution: Gene expression profile known unknown unknown Expression = Signature Mixture
Deconvolution: Algorithms known known unknown Expression = Signature Mixture : Linear regression known unknown known Expression = Signature Mixture : Linear regression known unknown unknown Expression = Signature Mixture :Matrix factorization
Notations number of observations: number of dimensions: n d data matrix: i-th data point: response vector: X 2 R n d x i = {X i,1,x i,2,...,x i,d }2R 1 d y 2 R n
Linear Regression disease disease disease disease disease disease disease disease y i = x T i + i = X j jx i,j + i Fitting error: i
Linear Regression disease disease disease disease disease disease disease disease Assumption: errors are Gaussian noises y = X = arg min X i + (y i X j jx i,j ) 2
Linear Regression
Linear Regression = arg min X i (y i X j jx i,j ) 2 = arg min(y X ) T (y X ) =(X T X) 1 X T y Question: How to derive the closed-form solution?
Underfitting and Overfitting Too simple Too complex Control the complexity or degrees of freedom of the model: 1. Add constraints of the model 2. Add regularizations
Linear regression with constraints simplex constraint
Linear constraints Equality constraints: Q = b Q 2 R m d b 2 R m where Example: sum-to-one constraint: X j =1 Inequality constraints: j R c R 2 R l d c 2 R l where Example: nonnegative constraint: 8j, j 0
Norm constraints Norm: a function that assigns positive length or size to each non-zero vector in a vector space L1 norm 1 = X j j L2 norm 2 =( X j j 2 ) 1/2 Lp norm p =( X j j p ) 1/p
Regularization: norm constraints X j j applet X j 2 j apple t 2
Largrange multiplier min x f(x) subject to g(x) =0 h(x) apple t Remove constraints: min x,, L(x,, )=f(x)+ T g(x)+ T (h(x) t) KKT conditions: O x L(x,, )=0 0 8i, i(h(x) i t i )=0
Largrange multiplier Problem 1 min ky X k 2 2 subject to k k 2 2 apple t Problem 2 min ky X k 2 2 + k k 2 2 TODO: Solving these two problems are equivalent.
L2 regularization min ky X k 2 2 + k k 2 2 Taking the derivative of RHS: =(X T X + I) 1 X T y Comments: 1. Introducing the regularization make the regression numerically more robust 2. The regularization term controls the size of the solution 3. Practically, the regularization coefficient is chosen by cross-validation
Probabilistic (Bayesian) interpretation Suppose we have a Gaussian prior N(0, 1 I) Since we assume that the error term is also a Gaussian random variable, then the data likelihood can be written as L( ) / exp( ky X k 2 2) The posterior mean would be =(X T X + I) 1 X T y
Sparse regularization = arg min ky X k 2 2 + k k 0 where the L0 norm is the number of non-zeros in the vector Combinatorial explosion: we need to enumerate all possible subsets
Sparse regularization = arg min ky X k 2 2 + k k 0 where the L0 norm is the number of non-zeros in the vector Combinatorial explosion: we need to enumerate all possible subsets Solution: Relaxation to L1 norm = arg min ky X k 2 2 + k k 1 Comments: 1. It allows efficient feature selection 2. Non-smooth objective function
TODO: sub-gradient and LASSO
TODO: Convex optimization
Convex Optimization min x f(x) subject to g(x) =0 h(x) apple t The above problem is convex if: 1. the objective function is convex 2. the feasible set defined by the constraints is convex Local optimum is global optimum
Convex Optimization Solvers
Matrix factorization Matrix factorization and low-rank approximation: Variants: Nonnegative matrix factorization Sparse matrix factorization
Matrix Norm Frobenius Norm X kak F =( i s X Nuclear Norm = k X j 2 k A 2 i,j) 1/2 = trace(a T A) 1/2 L2 norm of matrices kak = trace( p A T A)= X k k L1 norm of matrices
Matrix factorization Matrix factorization and low-rank approximation: X 2 R n m W 2 R n r H 2 R r m We now want to find optimal W and H min kx W,H WHk2 F
How to solve this optimization problem? min kx W,H WHk2 F
Regularized Matrix Factorization min kx W,H WHk2 F We assume the ranks of W and H are no larger than r: min kx W,H WHk2 F + (kw k 2 F + khk 2 F ) W 2 R n r H 2 R r m
Regularized matrix approximation We want to find a low-rank approximation with an implicit regularization: min Y kx Y k2 F + ky k L1 -regularization We can also solve this explicit low-rank problem min Y kx WHk2 F + 2 (kw k 2 F + khk 2 F ) L2 -regularization + low-rank
Nonnegative Matrix Factorization X 2 R n m W 2 R n r H 2 R r m We now want to find optimal W and H X WH subject to 8k, j H k,j 0 8i, k W i,k 0
Nonnegative Matrix Factorization Minimize the Frobenius norm min kx W,H WHk2 F Minimize the KL divergence where min D(XkWH) W,H D(AkB) = X i,j (A i,j log A i,j B i,j A i,j + B i,j )
Coordinate optimization X Optimize one variable while fixing values of all other variables where a and r are column and row vectors of X
Coordinate optimization Gradient calculation: g(h ij )=h ij h ij (W T X) ij (W T WH) ij g(w ij )=w ij w ij (XH T ) ij (WHH T ) ij gradient descent step may violate the nonnegative constraint. w new ij = w ij + g(w ij ) h new ij = h ij + g(h ij ) g w +
Coordinate optimization Projected gradient descent w new ij = max(0,w ij + g(w ij )) h new ij = max(0,h ij + g(h ij )) g + w Multiplicative updates w new ij w ij (XH T ) ij (WHH T ) ij h new ij h ij (W T X) ij (W T WH) ij
TODO: derive the multiplicative update rules for KL divergence
Netflix Challenge