Applied Machine Learning Annalisa Marsico

Size: px

Start display at page:

Download "Applied Machine Learning Annalisa Marsico"

Barry McDaniel
5 years ago
Views:

1 Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin SoSe 2015

2 What is Machine Learning?

3 What is Machine Learning? The field of Machine Learning seeks to answer the question: How can we build computer systems that automatically improves with experience, and what are the fundamental laws that govern all learning processes? Arthur Samuel (1959): field of study that gives computers the ability to learn without being explicitly programmed ex: playing checkers against Samuel, the computer eventually became much better than Samuel this was the first solid refutation to the claim that computers cannot learn

4 What is Machine Learning? Tom Mitchell (1998): a computer learns from experience E with respect to some task T and some performance measure P, if its performance on T as measured by P improves with E

5 What is Machine Learning? Computer Science ML Statistics How can we build machines that solve problems and which problems are tractable/intractable? What can be inferred from the data plus some modeling assumptions, with what reliability?

6 ML s applications Army, security imaging: object/face detection and recognition, object traking mobility: robotics, action learning, automatic driving Computers, internet interfaces: brainwaves (for the disable), handwriting / speech recognition security: spam / virus filtering, virus troubleshooting

7 ML s applications Finance banking: identify good, dissatisfied or prospective customers optimize / minimize credit risk market analysis Gaming intelligent: adaptibility to the player, agents object tracking, 3D modeling, etc...

8 ML s applications Biomedicine, biometrics medicine: screening, diagnosis and prognosis, drug discovery etc.. security: face recognition, signature, fingerprint, iris verification etc.. Bioinformatics motif finder, gene detectors, interaction networks, gene expression predictors, cancer/disease classification, protein folding prediction, etc..

9 Examples of Learning problems Predict whether a patient, hospitalized due to a heart attack, will have a seocnd heart attack, based on diet, blood tests, diesease history.. Identify the risk factor for colon cancer, based on gene expression and clinical measurements. Predict if an is spam or not based on most commonly occurring words ( /spam -> classification problem) Predict the price of a stock in 6 months from now, based on company performance and economic data

10 You already use it! Some more examples from daily life.. Based on past choices, which movies will interest this viewer? (Netflix) Based on past choices and metadata which music this user will probably like? (Lastfm, Spotify) Based on past choices and profile features should we match these people in online dating service (Tinder) Based on previous purchases, which shoes is the user likely to like? (Zalando) However, predictive models regularly generate wrong predicitons: In 2010 an errouneous algorithm has caused a finantial crash..

11 Learning process Predictive modeling: process of developing a mathematical tool or model that generates accurate predictions

12 Prediction vs Interpretation It is always a trade-off If the goal is high accuracy (e.g. Spam filter) then we do not care why and how the model reaches it If the goal is interpretability (e.g. In Biology, SNPs which predict a certian disease risk) then we care why and how

13 Key ingredients for a successful predictive model Deep knowledge of the context and the problem If a signal is present in the data you are gonna find it Choose your features carefully (e.g. collect relevant data) Versatile computational toolbox for model building, but also data pre-processing, visualization, statistics Weka, Knime, R (check out caret package) Critical evaluation

14 Supervised vs Unsupervised Learning Typical Scenario We have an outcome quantitative (price of a stock, risk factor..) or categorical (heart attack yes or no) that we want to predict based on some features. We have a training set of data and build a prediction model, a learner, able to predict the outcome of new unseen objects - A good learner accurately predicts such an outcome - Supervised learning: the presence of the outcome variable is guiding the learning process - Unsupervised learning: we have only features, no outcome - Task is rather to describe the data

15 Unsupervised learning find a structure in the data Given X ={x n } measurements / obervations /features find a model M such that p(m X) is maximized i.e. Find the process that is most likely to have generate the data

16 Supervised learning Find the connection between two sets of observations: the input set, and the output set given {x n, y n }, find an hypothesis f (function, classification boundary), such that n [1..N], N number of observations, f(x n ) = y n X={x n } also called predictors, independent variables or covariates Y={y n } also called response, dependent variable

17 Example1: Colorectal Cancer There is a correlation between CSA (colon specific antigen) and a number of clinical measuremnets in 200 patients. Goal: predict CSA from clinical measurements Supervised learning Regression problem (outcome measure is quantitative)

Example2: Gene expression microarrays Measure the expression of all genes in a cell simultaneously, by measuring the amount of RNA present in the cell for that gene.

18 Example2: Gene expression microarrays Measure the expression of all genes in a cell simultaneously, by measuring the amount of RNA present in the cell for that gene. We do this for several experiments (samples). Goal: understand how genes and samples are organized - Which genes are predictive for certain samples? Unsupervised learning: p (# of samples) << N (# of genes) Supervised learning: yes, possible, with some tricks

19 Variable Types Y quantitative -> regression model Y qualitative (categorical) -> classification model (two or more classes) Inputs X can also be quantitative or qualitative there can be missing values dummy variable sometime a convenient way Both problems can be viewed as a task in function approximation f(x)

20 Let s re-formulate the training task Given X (features), make a good prediction of Y, denoted by Ŷ (i.e. Identify appropriate function f(x) to model Y). If Y takes values in R, then so should Ŷ (quantitative response). For categorical output Ĝ should take a class value, as G (categorical response).

21 Supervised Linear Models

22 Linear Models and Least Square p T X X X X,... 1, 2 ˆ ˆ ˆ ˆ ) ( ˆ 1 0 T p j j j X Y or X X f Y Matrix notation Unknown coefficients parameters of the model Given a vector of inputs p = # of features; N = # of points, we want to predict the output Y via the model: For each point i, i=1...n ip p i i i x x x y N.B we have included β 0 in the coefficient vector

23 Linear Models and Least Square We want to fit a linear model to a set of training data {(x i1...x ip ), y i }. There might be several choices of β. How do we choose them?

24 Least square method: we pick the coefficients β to minimize the residual sum of squares Linear Models and Least Square N i T i i N i p j j ij i N i i i x y x y x f y RSS ) ( ) ( The solution is easy to characterize If we write it in matrix notation Y X X X X Y X X Y X Y RSS T T T T 1 ) ( ˆ 0 ) ( ) ( ) ( ) ( differentiation with respect to β One feature two features What happens if p > N? I.e. X T X is singular?

25 Another geometrical interpretation of linear regression Least-square regression with two predictors. The outcome vector y is orthogonally projected into the hyperplane spanned by input vectors x 1 and x 2. The projection ^y represents the vector of the least square prediction. 2 We minimize RSS( ) y X by choosing β so that the residual vector is orthogonal to this subspace.

Example: Quantitative Structure- Activity Relationship We want to study the relationship between chemical structure and activity (solubility) Screen several compounds against a target in a biolgical

26 Example: Quantitative Structure- Activity Relationship We want to study the relationship between chemical structure and activity (solubility) Screen several compounds against a target in a biolgical assay Measure quantitative features x j (molecular weight, electrical charge, surface area, # of atoms..) The response y is the activity (inibition, solubility..) y i x x i1 2 i2 Quantitative structure-activity relationship (QSAR modeling) p x ip Aspirin

27 Measuring Performance in Regression Models If the outcome is a number -> RMSE (function of the model residuals) RMSE 1 N N i1 ( y i yˆ i real value 2 ) predicted value Another measure is R 2 -> proportion of information in the data which is explained by the model. More a measure of correlation

28 A short de-tour of the Predictive Modeling Process Always do a scatter plot of response vs each feature to see if a linear relationship exists! Introduce some Non-linearity into the model Fit to local linear regression y 2 0 1x1 2x1

29 A short de-tour of the Predictive Modeling Process How the predictors enter the model is very important: 1. Data transformation 1. Centering / scaling 2. data skewed 3. Outliers 2. feature engineering / feature extraction 1. What are actually the informative features?

30 A short de-tour of the Predictive Modeling Process Data transformation Necessary to avoid biases Z x x Skewness mean of the data - centering standard deviation - scaling s v i ( n 1) v i x x i i x n 1 3/ 2 x 3 2 A value s of 20 indicates high skewness. Log transformation helps reducing the skewness

31 Between-Predictor Correlations Predictors can be correlated. If correlation among predictors is high, then the Ordinary least square solution for linear regression will have high variability and will be unstable -> poor interpretation Correlation heatmap for the structure-solubility data Collinearity: high correlation between pairs of variables

32 Data reduction and feature extraction We want to have a smaller set of predictors which captures most of the Information in the data -> maybe predictors which are combinations of the original predictors? Principal Component Analysis (PCA) is a commonly used data reduction technique

33 A short de-tour of the Predictive Modeling Process Data reduction and feature extraction What about removing correlated predictors? Yes, possible, but there are cases where a predictor is correlated to a Linear combination of other predictors..not detectable with correlation analysis Other reasons to remove predictors: 1. Zero variance predictors (variables with few unique values) 2. Frequency of unique values is severely disproportioned

34 Goal: We want a technique (regression) which takes into account (solves) correlated variables.. Regression + feature reduction

35 Principal Component Analysis (PCA) Idea: Given data points (predictors) in d-dimensional space, project into lower dimensional space while preserving as much information as possible E.g. Find best planar approximation to 3D data Learns lower dimensional representation of inputs Underlines structure in the data It generates a smaller set of predictors which captures the majority of the information in the original variables New predictors are functions of the original predictors

36 Example 1: study the motion of a spring The important dimension to describe the dynamics of the system is x but we do not know that! Every time sample recorded by the cameras is a point (vector) in a D-dimensional space, D=6 Form linear algebra: every vector in a D-dimensional space can be written as linear combination of some basis Is there other basis (linear combination of original basis) which better re-expresses the data?

37 Principal Component Analysis (PCA) The hope is that the new basis will filter out the noise and reveal the hidden structure of the data -> In my case they will determine x as the important direction.. You may have noticed the use of the word linear: PCA makes the stringent but powerful assumption of linearity -> restricts the set of potential bases

38 PCA formal definition PCA: orthogonal projection of the data into a lower dimensional space, such as the variance of the projected data is maximum

39 Variance and the goal Quantitatively we assume that directions with largest variances in our data space contain the dynamics of interest and so highest SNR

40 Principal Component analysis y x Geometrical interpretation: find the rotation of the basis (axes) in a way that the first axis lies in the direction of greatest variation. In the new system the predictors (PCs) are orthogonal

41 PCA - Redundancy When two predictors x1 and x2 are correlated (measure redundant information), this will complicate the effect of x1 and x2 on the response. It seems that either one predictor or a linear combination of predictors can be used here

42 PCA in words Find the linear combination of X (in the new basis) which has the maximum variation How do we formally find these new directions (basis) u i? Project data on new directions X T u Find u 1 such that var(x T u 1 ) is maximized subjected to the condition u 1T u 1 =1 Find u 2 such that var(x T u 2 ) is maximized subjected to the condition u 2T u 2 =1 and u 1T u 2 =0 Keep finding direction of greatest variation orthogonal to those already found Ideally, if N is the dimensionality of original data, we need only few D < N directions to explain sufficiently the variability in the data

43 How many Principal Components? Use the eigenvalues, which represent the variance explained by each component Choose the number of eigenvalues that amount to the desired percentage of the variance Scree plot

44 PCA example: image compression

45 Principal Component Analysis (PCA) PCs are surrogate features / variables and therefore (linear) functions of the original variables which better re-express the data Then we can express the PCs as linear combinations of the original predictors. The first PC is the best linear combination the one capturing most of the variance PC j a feature a feature 2 a featurep... j1 1 j 2 jp p = # of predictors a j1, a j2,... a jp component weights / loadings

46 Summarizing.. The cool thing is that we have create components PCs which are uncorrelated Some predictive models prefer predictors which are uncorrelated in order To find a good solution. PCA creates new predictors with such characteristics! To get an intuition of the data: If the PCA captured most of the information in the data, then plotting E.g. PC1 vs PC2 can reveal clusters/structures In the data

47 PCA practical hints 1. PCA seeks direction of maximum variance, so it is sensitive to the scale of the data, it might give higher weights to variables on large scales. Good practise is re-scale the data before doing PCA 2. Skeweness can also cause problems

48 Goal: We want a technique (regression) which takes into account (solves) correlated variables.. Regression + feature reduction But PCA is an unsupervised technique..so it is blind to the response

49 Principal Component Regression (PCR) Dimension reduction method: it works in two steps 1. Find transformed predictors Z 1, Z 2,..Z m with m < p (# of original features) 2. Fit a least square model to these new predictors Z m p j1 a jm X j Fitting a regression model to Z m M y i 0 Z m1 m im The choise of Z 1...Z m and the selection of a jm can be achieved in different ways One way is Principal Component Regression (PCR) almost PLS.. E.g. Z 1 a11x1 a21x2 First principal components in the case of two variables scores or loadings

50 Drawback of PCR We assume that the direction in which x i show the most variation are the directions associated to the reponse y.. If this assumption holds, then an appropriate choice of M = # of components will give better results. But this assumption is not always fullfilled and when Z 1..Z m are produced in an unsupervised way there is no garantee that these directions (which best explain the input) are also the best to explain the output. When will PCR perform worse than normal least square regression?

51 Partial Least Square Regression (PLSR) Supervised alternative to PCR. It makes use of the response Y to identify the new features attempts to find directions that help explaining both the response and the predictors

52 PLS Algorithm 1. Compute first partial least square direction Z 1 by setting a j1 in the formula to the coefficients from simple linear regression of Y onto X j Z 1 p j1 Z m a j 1X j p j1 a jm X j 2. Different intepretation of the loading a jm : here, how much the predictor is important for the reponse! 3. Then Y is regressed on Z 1, giving θ 1 4. To find Z 2 we adjust all variables for Z 1. Means we project or regress them to Z 1 Xˆ j jz 1 5. Compute the residuals (the remaining information which has not been explained by the first PLS) X j jz 1 6. Compute Z 2 (Z m ) in the same way, using the projected data 7. The iterative approach can be repeated M times to identify multiple PLS comp

53 Example from the QSAR modeling problem - PCR Scatter plot of two predictors Direction of the first PC The first PC direction contains no predictive information of the response

54 Example from the QSAR modeling problem - PLS PLS direction on two predictors PLS direction contains highly predictive information of the response

55 Example from the QSAR modeling problem PCR & PLS Compaison of PLS and PCR

56 Summary Dimension reduction (PCA) Regression problem Linear regression (least-square) PCR and PLS are methods for feature reduction and de-correlation of the features Improves over-fitting, accuracy, can be hard to interpret

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature