CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

Size: px

Start display at page:

Download "CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu"

Bernice Day
5 years ago
Views:

1 CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

Interactions between features pairwise constraints high-order complex patterns combinatorial

2 Feature engineering is hard 1. Extract informative features from domain knowledge 2. Suitable feature representation discrete or continuous log-scale or linear scale 3. Interactions between features pairwise constraints high-order complex patterns combinatorial explosion How do we know which features are useful? Features are highly correlated How to combine features?

3 Deep learning is feature learning Key Idea: 1. Learning complex and meaningful features from raw data 2. Using layer-wise structure to extract features. Higher level features are derived from lower level features. 3. End-to-end learning

4 Neural Logistic regression y = P exp( T y h) c2class exp( T c h) (x) h(i) = sigmoid(w T i (x))

5 Conditional Neural Fields y(i-1) y(i) y(i+1)

6 Matrix Data Matrix data Gene expression Dimensionality reduction and feature selection Low-rank approximation

7 High-dimensional data Gene Individual p(number of genes) n(number of individuals)

8 Overfitting disease disease disease disease disease disease disease disease p(number of parameters) n(number of data points)

High-dimensional data lies on a lowerdimensional space (a.k.

9 Component analysis How to understand the main signals from the data? Key assumptions: 1. High-dimensional data lies on a lowerdimensional space (a.k.a low-rank assumption) 2. Projections in the lower-dimensional space describes major properties of the data

10 Matrix decomposition Matrix factorization and low-rank approximation: Singular value decomposition (SVD): Other generalizations: Nonnegative matrix factorization Sparse matrix factorization

11 Component Analysis: Tissue-specific gene expression Blood Brain Liver

12 Deconvolution: Gene expression profile known known unknown Expression = Signature Mixture

13 Deconvolution: Gene expression profile known unknown known Expression = Signature Mixture

14 Deconvolution: Gene expression profile known unknown unknown Expression = Signature Mixture

15 Deconvolution: Algorithms known known unknown Expression = Signature Mixture : Linear regression known unknown known Expression = Signature Mixture : Linear regression known unknown unknown Expression = Signature Mixture :Matrix factorization

16 Notations number of observations: number of dimensions: n d data matrix: i-th data point: response vector: X 2 R n d x i = {X i,1,x i,2,...,x i,d }2R 1 d y 2 R n

17 Linear Regression disease disease disease disease disease disease disease disease y i = x T i + i = X j jx i,j + i Fitting error: i

18 Linear Regression disease disease disease disease disease disease disease disease Assumption: errors are Gaussian noises y = X = arg min X i + (y i X j jx i,j ) 2

19 Linear Regression

20 Linear Regression = arg min X i (y i X j jx i,j ) 2 = arg min(y X ) T (y X ) =(X T X) 1 X T y Question: How to derive the closed-form solution?

21 Underfitting and Overfitting Too simple Too complex Control the complexity or degrees of freedom of the model: 1. Add constraints of the model 2. Add regularizations

22 Linear regression with constraints simplex constraint

23 Linear constraints Equality constraints: Q = b Q 2 R m d b 2 R m where Example: sum-to-one constraint: X j =1 Inequality constraints: j R c R 2 R l d c 2 R l where Example: nonnegative constraint: 8j, j 0

24 Norm constraints Norm: a function that assigns positive length or size to each non-zero vector in a vector space L1 norm 1 = X j j L2 norm 2 =( X j j 2 ) 1/2 Lp norm p =( X j j p ) 1/p

25 Regularization: norm constraints X j j applet X j 2 j apple t 2

26 Largrange multiplier min x f(x) subject to g(x) =0 h(x) apple t Remove constraints: min x,, L(x,, )=f(x)+ T g(x)+ T (h(x) t) KKT conditions: O x L(x,, )=0 0 8i, i(h(x) i t i )=0

27 Largrange multiplier Problem 1 min ky X k 2 2 subject to k k 2 2 apple t Problem 2 min ky X k k k 2 2 TODO: Solving these two problems are equivalent.

28 L2 regularization min ky X k k k 2 2 Taking the derivative of RHS: =(X T X + I) 1 X T y Comments: 1. Introducing the regularization make the regression numerically more robust 2. The regularization term controls the size of the solution 3. Practically, the regularization coefficient is chosen by cross-validation

29 Probabilistic (Bayesian) interpretation Suppose we have a Gaussian prior N(0, 1 I) Since we assume that the error term is also a Gaussian random variable, then the data likelihood can be written as L( ) / exp( ky X k 2 2) The posterior mean would be =(X T X + I) 1 X T y

30 Sparse regularization = arg min ky X k k k 0 where the L0 norm is the number of non-zeros in the vector Combinatorial explosion: we need to enumerate all possible subsets

31 Sparse regularization = arg min ky X k k k 0 where the L0 norm is the number of non-zeros in the vector Combinatorial explosion: we need to enumerate all possible subsets Solution: Relaxation to L1 norm = arg min ky X k k k 1 Comments: 1. It allows efficient feature selection 2. Non-smooth objective function

32 TODO: sub-gradient and LASSO

33 TODO: Convex optimization

34 Convex Optimization min x f(x) subject to g(x) =0 h(x) apple t The above problem is convex if: 1. the objective function is convex 2. the feasible set defined by the constraints is convex Local optimum is global optimum

35 Convex Optimization Solvers

36 Matrix factorization Matrix factorization and low-rank approximation: Variants: Nonnegative matrix factorization Sparse matrix factorization

37 Matrix Norm Frobenius Norm X kak F =( i s X Nuclear Norm = k X j 2 k A 2 i,j) 1/2 = trace(a T A) 1/2 L2 norm of matrices kak = trace( p A T A)= X k k L1 norm of matrices

38 Matrix factorization Matrix factorization and low-rank approximation: X 2 R n m W 2 R n r H 2 R r m We now want to find optimal W and H min kx W,H WHk2 F

39 How to solve this optimization problem? min kx W,H WHk2 F

40 Regularized Matrix Factorization min kx W,H WHk2 F We assume the ranks of W and H are no larger than r: min kx W,H WHk2 F + (kw k 2 F + khk 2 F ) W 2 R n r H 2 R r m

41 Regularized matrix approximation We want to find a low-rank approximation with an implicit regularization: min Y kx Y k2 F + ky k L1 -regularization We can also solve this explicit low-rank problem min Y kx WHk2 F + 2 (kw k 2 F + khk 2 F ) L2 -regularization + low-rank

42 Nonnegative Matrix Factorization X 2 R n m W 2 R n r H 2 R r m We now want to find optimal W and H X WH subject to 8k, j H k,j 0 8i, k W i,k 0

43 Nonnegative Matrix Factorization Minimize the Frobenius norm min kx W,H WHk2 F Minimize the KL divergence where min D(XkWH) W,H D(AkB) = X i,j (A i,j log A i,j B i,j A i,j + B i,j )

44 Coordinate optimization X Optimize one variable while fixing values of all other variables where a and r are column and row vectors of X

45 Coordinate optimization Gradient calculation: g(h ij )=h ij h ij (W T X) ij (W T WH) ij g(w ij )=w ij w ij (XH T ) ij (WHH T ) ij gradient descent step may violate the nonnegative constraint. w new ij = w ij + g(w ij ) h new ij = h ij + g(h ij ) g w +

46 Coordinate optimization Projected gradient descent w new ij = max(0,w ij + g(w ij )) h new ij = max(0,h ij + g(h ij )) g + w Multiplicative updates w new ij w ij (XH T ) ij (WHH T ) ij h new ij h ij (W T X) ij (W T WH) ij

47 TODO: derive the multiplicative update rules for KL divergence

48 Netflix Challenge

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem