Applied Machine Learning Annalisa Marsico

Size: px
Start display at page:

Download "Applied Machine Learning Annalisa Marsico"

Transcription

1 Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin SoSe 2015

2 What is Machine Learning?

3 What is Machine Learning? The field of Machine Learning seeks to answer the question: How can we build computer systems that automatically improves with experience, and what are the fundamental laws that govern all learning processes? Arthur Samuel (1959): field of study that gives computers the ability to learn without being explicitly programmed ex: playing checkers against Samuel, the computer eventually became much better than Samuel this was the first solid refutation to the claim that computers cannot learn

4 What is Machine Learning? Tom Mitchell (1998): a computer learns from experience E with respect to some task T and some performance measure P, if its performance on T as measured by P improves with E

5 What is Machine Learning? Computer Science ML Statistics How can we build machines that solve problems and which problems are tractable/intractable? What can be inferred from the data plus some modeling assumptions, with what reliability?

6 ML s applications Army, security imaging: object/face detection and recognition, object traking mobility: robotics, action learning, automatic driving Computers, internet interfaces: brainwaves (for the disable), handwriting / speech recognition security: spam / virus filtering, virus troubleshooting

7 ML s applications Finance banking: identify good, dissatisfied or prospective customers optimize / minimize credit risk market analysis Gaming intelligent: adaptibility to the player, agents object tracking, 3D modeling, etc...

8 ML s applications Biomedicine, biometrics medicine: screening, diagnosis and prognosis, drug discovery etc.. security: face recognition, signature, fingerprint, iris verification etc.. Bioinformatics motif finder, gene detectors, interaction networks, gene expression predictors, cancer/disease classification, protein folding prediction, etc..

9 Examples of Learning problems Predict whether a patient, hospitalized due to a heart attack, will have a seocnd heart attack, based on diet, blood tests, diesease history.. Identify the risk factor for colon cancer, based on gene expression and clinical measurements. Predict if an is spam or not based on most commonly occurring words ( /spam -> classification problem) Predict the price of a stock in 6 months from now, based on company performance and economic data

10 You already use it! Some more examples from daily life.. Based on past choices, which movies will interest this viewer? (Netflix) Based on past choices and metadata which music this user will probably like? (Lastfm, Spotify) Based on past choices and profile features should we match these people in online dating service (Tinder) Based on previous purchases, which shoes is the user likely to like? (Zalando) However, predictive models regularly generate wrong predicitons: In 2010 an errouneous algorithm has caused a finantial crash..

11 Learning process Predictive modeling: process of developing a mathematical tool or model that generates accurate predictions

12 Prediction vs Interpretation It is always a trade-off If the goal is high accuracy (e.g. Spam filter) then we do not care why and how the model reaches it If the goal is interpretability (e.g. In Biology, SNPs which predict a certian disease risk) then we care why and how

13 Key ingredients for a successful predictive model Deep knowledge of the context and the problem If a signal is present in the data you are gonna find it Choose your features carefully (e.g. collect relevant data) Versatile computational toolbox for model building, but also data pre-processing, visualization, statistics Weka, Knime, R (check out caret package) Critical evaluation

14 Supervised vs Unsupervised Learning Typical Scenario We have an outcome quantitative (price of a stock, risk factor..) or categorical (heart attack yes or no) that we want to predict based on some features. We have a training set of data and build a prediction model, a learner, able to predict the outcome of new unseen objects - A good learner accurately predicts such an outcome - Supervised learning: the presence of the outcome variable is guiding the learning process - Unsupervised learning: we have only features, no outcome - Task is rather to describe the data

15 Unsupervised learning find a structure in the data Given X ={x n } measurements / obervations /features find a model M such that p(m X) is maximized i.e. Find the process that is most likely to have generate the data

16 Supervised learning Find the connection between two sets of observations: the input set, and the output set given {x n, y n }, find an hypothesis f (function, classification boundary), such that n [1..N], N number of observations, f(x n ) = y n X={x n } also called predictors, independent variables or covariates Y={y n } also called response, dependent variable

17 Example1: Colorectal Cancer There is a correlation between CSA (colon specific antigen) and a number of clinical measuremnets in 200 patients. Goal: predict CSA from clinical measurements Supervised learning Regression problem (outcome measure is quantitative)

18 Example2: Gene expression microarrays Measure the expression of all genes in a cell simultaneously, by measuring the amount of RNA present in the cell for that gene. We do this for several experiments (samples). Goal: understand how genes and samples are organized - Which genes are predictive for certain samples? Unsupervised learning: p (# of samples) << N (# of genes) Supervised learning: yes, possible, with some tricks

19 Variable Types Y quantitative -> regression model Y qualitative (categorical) -> classification model (two or more classes) Inputs X can also be quantitative or qualitative there can be missing values dummy variable sometime a convenient way Both problems can be viewed as a task in function approximation f(x)

20 Let s re-formulate the training task Given X (features), make a good prediction of Y, denoted by Ŷ (i.e. Identify appropriate function f(x) to model Y). If Y takes values in R, then so should Ŷ (quantitative response). For categorical output Ĝ should take a class value, as G (categorical response).

21 Supervised Linear Models

22 Linear Models and Least Square p T X X X X,... 1, 2 ˆ ˆ ˆ ˆ ) ( ˆ 1 0 T p j j j X Y or X X f Y Matrix notation Unknown coefficients parameters of the model Given a vector of inputs p = # of features; N = # of points, we want to predict the output Y via the model: For each point i, i=1...n ip p i i i x x x y N.B we have included β 0 in the coefficient vector

23 Linear Models and Least Square We want to fit a linear model to a set of training data {(x i1...x ip ), y i }. There might be several choices of β. How do we choose them?

24 Least square method: we pick the coefficients β to minimize the residual sum of squares Linear Models and Least Square N i T i i N i p j j ij i N i i i x y x y x f y RSS ) ( ) ( The solution is easy to characterize If we write it in matrix notation Y X X X X Y X X Y X Y RSS T T T T 1 ) ( ˆ 0 ) ( ) ( ) ( ) ( differentiation with respect to β One feature two features What happens if p > N? I.e. X T X is singular?

25 Another geometrical interpretation of linear regression Least-square regression with two predictors. The outcome vector y is orthogonally projected into the hyperplane spanned by input vectors x 1 and x 2. The projection ^y represents the vector of the least square prediction. 2 We minimize RSS( ) y X by choosing β so that the residual vector is orthogonal to this subspace.

26 Example: Quantitative Structure- Activity Relationship We want to study the relationship between chemical structure and activity (solubility) Screen several compounds against a target in a biolgical assay Measure quantitative features x j (molecular weight, electrical charge, surface area, # of atoms..) The response y is the activity (inibition, solubility..) y i x x i1 2 i2 Quantitative structure-activity relationship (QSAR modeling) p x ip Aspirin

27 Measuring Performance in Regression Models If the outcome is a number -> RMSE (function of the model residuals) RMSE 1 N N i1 ( y i yˆ i real value 2 ) predicted value Another measure is R 2 -> proportion of information in the data which is explained by the model. More a measure of correlation

28 A short de-tour of the Predictive Modeling Process Always do a scatter plot of response vs each feature to see if a linear relationship exists! Introduce some Non-linearity into the model Fit to local linear regression y 2 0 1x1 2x1

29 A short de-tour of the Predictive Modeling Process How the predictors enter the model is very important: 1. Data transformation 1. Centering / scaling 2. data skewed 3. Outliers 2. feature engineering / feature extraction 1. What are actually the informative features?

30 A short de-tour of the Predictive Modeling Process Data transformation Necessary to avoid biases Z x x Skewness mean of the data - centering standard deviation - scaling s v i ( n 1) v i x x i i x n 1 3/ 2 x 3 2 A value s of 20 indicates high skewness. Log transformation helps reducing the skewness

31 Between-Predictor Correlations Predictors can be correlated. If correlation among predictors is high, then the Ordinary least square solution for linear regression will have high variability and will be unstable -> poor interpretation Correlation heatmap for the structure-solubility data Collinearity: high correlation between pairs of variables

32 Data reduction and feature extraction We want to have a smaller set of predictors which captures most of the Information in the data -> maybe predictors which are combinations of the original predictors? Principal Component Analysis (PCA) is a commonly used data reduction technique

33 A short de-tour of the Predictive Modeling Process Data reduction and feature extraction What about removing correlated predictors? Yes, possible, but there are cases where a predictor is correlated to a Linear combination of other predictors..not detectable with correlation analysis Other reasons to remove predictors: 1. Zero variance predictors (variables with few unique values) 2. Frequency of unique values is severely disproportioned

34 Goal: We want a technique (regression) which takes into account (solves) correlated variables.. Regression + feature reduction

35 Principal Component Analysis (PCA) Idea: Given data points (predictors) in d-dimensional space, project into lower dimensional space while preserving as much information as possible E.g. Find best planar approximation to 3D data Learns lower dimensional representation of inputs Underlines structure in the data It generates a smaller set of predictors which captures the majority of the information in the original variables New predictors are functions of the original predictors

36 Example 1: study the motion of a spring The important dimension to describe the dynamics of the system is x but we do not know that! Every time sample recorded by the cameras is a point (vector) in a D-dimensional space, D=6 Form linear algebra: every vector in a D-dimensional space can be written as linear combination of some basis Is there other basis (linear combination of original basis) which better re-expresses the data?

37 Principal Component Analysis (PCA) The hope is that the new basis will filter out the noise and reveal the hidden structure of the data -> In my case they will determine x as the important direction.. You may have noticed the use of the word linear: PCA makes the stringent but powerful assumption of linearity -> restricts the set of potential bases

38 PCA formal definition PCA: orthogonal projection of the data into a lower dimensional space, such as the variance of the projected data is maximum

39 Variance and the goal Quantitatively we assume that directions with largest variances in our data space contain the dynamics of interest and so highest SNR

40 Principal Component analysis y x Geometrical interpretation: find the rotation of the basis (axes) in a way that the first axis lies in the direction of greatest variation. In the new system the predictors (PCs) are orthogonal

41 PCA - Redundancy When two predictors x1 and x2 are correlated (measure redundant information), this will complicate the effect of x1 and x2 on the response. It seems that either one predictor or a linear combination of predictors can be used here

42 PCA in words Find the linear combination of X (in the new basis) which has the maximum variation How do we formally find these new directions (basis) u i? Project data on new directions X T u Find u 1 such that var(x T u 1 ) is maximized subjected to the condition u 1T u 1 =1 Find u 2 such that var(x T u 2 ) is maximized subjected to the condition u 2T u 2 =1 and u 1T u 2 =0 Keep finding direction of greatest variation orthogonal to those already found Ideally, if N is the dimensionality of original data, we need only few D < N directions to explain sufficiently the variability in the data

43 How many Principal Components? Use the eigenvalues, which represent the variance explained by each component Choose the number of eigenvalues that amount to the desired percentage of the variance Scree plot

44 PCA example: image compression

45 Principal Component Analysis (PCA) PCs are surrogate features / variables and therefore (linear) functions of the original variables which better re-express the data Then we can express the PCs as linear combinations of the original predictors. The first PC is the best linear combination the one capturing most of the variance PC j a feature a feature 2 a featurep... j1 1 j 2 jp p = # of predictors a j1, a j2,... a jp component weights / loadings

46 Summarizing.. The cool thing is that we have create components PCs which are uncorrelated Some predictive models prefer predictors which are uncorrelated in order To find a good solution. PCA creates new predictors with such characteristics! To get an intuition of the data: If the PCA captured most of the information in the data, then plotting E.g. PC1 vs PC2 can reveal clusters/structures In the data

47 PCA practical hints 1. PCA seeks direction of maximum variance, so it is sensitive to the scale of the data, it might give higher weights to variables on large scales. Good practise is re-scale the data before doing PCA 2. Skeweness can also cause problems

48 Goal: We want a technique (regression) which takes into account (solves) correlated variables.. Regression + feature reduction But PCA is an unsupervised technique..so it is blind to the response

49 Principal Component Regression (PCR) Dimension reduction method: it works in two steps 1. Find transformed predictors Z 1, Z 2,..Z m with m < p (# of original features) 2. Fit a least square model to these new predictors Z m p j1 a jm X j Fitting a regression model to Z m M y i 0 Z m1 m im The choise of Z 1...Z m and the selection of a jm can be achieved in different ways One way is Principal Component Regression (PCR) almost PLS.. E.g. Z 1 a11x1 a21x2 First principal components in the case of two variables scores or loadings

50 Drawback of PCR We assume that the direction in which x i show the most variation are the directions associated to the reponse y.. If this assumption holds, then an appropriate choice of M = # of components will give better results. But this assumption is not always fullfilled and when Z 1..Z m are produced in an unsupervised way there is no garantee that these directions (which best explain the input) are also the best to explain the output. When will PCR perform worse than normal least square regression?

51 Partial Least Square Regression (PLSR) Supervised alternative to PCR. It makes use of the response Y to identify the new features attempts to find directions that help explaining both the response and the predictors

52 PLS Algorithm 1. Compute first partial least square direction Z 1 by setting a j1 in the formula to the coefficients from simple linear regression of Y onto X j Z 1 p j1 Z m a j 1X j p j1 a jm X j 2. Different intepretation of the loading a jm : here, how much the predictor is important for the reponse! 3. Then Y is regressed on Z 1, giving θ 1 4. To find Z 2 we adjust all variables for Z 1. Means we project or regress them to Z 1 Xˆ j jz 1 5. Compute the residuals (the remaining information which has not been explained by the first PLS) X j jz 1 6. Compute Z 2 (Z m ) in the same way, using the projected data 7. The iterative approach can be repeated M times to identify multiple PLS comp

53 Example from the QSAR modeling problem - PCR Scatter plot of two predictors Direction of the first PC The first PC direction contains no predictive information of the response

54 Example from the QSAR modeling problem - PLS PLS direction on two predictors PLS direction contains highly predictive information of the response

55 Example from the QSAR modeling problem PCR & PLS Compaison of PLS and PCR

56 Summary Dimension reduction (PCA) Regression problem Linear regression (least-square) PCR and PLS are methods for feature reduction and de-correlation of the features Improves over-fitting, accuracy, can be hard to interpret

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [based on slides from Nina Balcan] slide 1 Goals for the lecture you should understand

More information

Machine Learning (COMP-652 and ECSE-608)

Machine Learning (COMP-652 and ECSE-608) Machine Learning (COMP-652 and ECSE-608) Lecturers: Guillaume Rabusseau and Riashat Islam Email: guillaume.rabusseau@mail.mcgill.ca - riashat.islam@mail.mcgill.ca Teaching assistant: Ethan Macdonald Email:

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu What is Machine Learning? Overview slides by ETHEM ALPAYDIN Why Learn? Learn: programming computers to optimize a performance criterion using example

More information

STATISTICAL LEARNING SYSTEMS

STATISTICAL LEARNING SYSTEMS STATISTICAL LEARNING SYSTEMS LECTURE 8: UNSUPERVISED LEARNING: FINDING STRUCTURE IN DATA Institute of Computer Science, Polish Academy of Sciences Ph. D. Program 2013/2014 Principal Component Analysis

More information

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han Math for Machine Learning Open Doors to Data Science and Artificial Intelligence Richard Han Copyright 05 Richard Han All rights reserved. CONTENTS PREFACE... - INTRODUCTION... LINEAR REGRESSION... 4 LINEAR

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

SF2935: MODERN METHODS OF STATISTICAL LECTURE 3 SUPERVISED CLASSIFICATION, LINEAR DISCRIMINANT ANALYSIS LEARNING. Tatjana Pavlenko.

SF2935: MODERN METHODS OF STATISTICAL LECTURE 3 SUPERVISED CLASSIFICATION, LINEAR DISCRIMINANT ANALYSIS LEARNING. Tatjana Pavlenko. SF2935: MODERN METHODS OF STATISTICAL LEARNING LECTURE 3 SUPERVISED CLASSIFICATION, LINEAR DISCRIMINANT ANALYSIS Tatjana Pavlenko 5 November 2015 SUPERVISED LEARNING (REP.) Starting point: we have an outcome

More information

Lecture 11 Linear regression

Lecture 11 Linear regression Advanced Algorithms Floriano Zini Free University of Bozen-Bolzano Faculty of Computer Science Academic Year 2013-2014 Lecture 11 Linear regression These slides are taken from Andrew Ng, Machine Learning

More information

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann Machine Learning! in just a few minutes Jan Peters Gerhard Neumann 1 Purpose of this Lecture Foundations of machine learning tools for robotics We focus on regression methods and general principles Often

More information

Machine Learning (COMP-652 and ECSE-608)

Machine Learning (COMP-652 and ECSE-608) Machine Learning (COMP-652 and ECSE-608) Instructors: Doina Precup and Guillaume Rabusseau Email: dprecup@cs.mcgill.ca and guillaume.rabusseau@mail.mcgill.ca Teaching assistants: Tianyu Li and TBA Class

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017 CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Course Info. Instructor: Nicholas Ruozzi Office: ECSS 3.409 Office hours: Tues.

More information

STK Statistical Learning: Advanced Regression and Classification

STK Statistical Learning: Advanced Regression and Classification STK4030 - Statistical Learning: Advanced Regression and Classification Riccardo De Bin debin@math.uio.no STK4030: lecture 1 1/ 42 Outline of the lecture Introduction Overview of supervised learning Variable

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

What is Principal Component Analysis?

What is Principal Component Analysis? What is Principal Component Analysis? Principal component analysis (PCA) Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables Retains most

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

The prediction of house price

The prediction of house price 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 29 April, SoSe 2015 Support Vector Machines (SVMs) 1. One of

More information

Chart types and when to use them

Chart types and when to use them APPENDIX A Chart types and when to use them Pie chart Figure illustration of pie chart 2.3 % 4.5 % Browser Usage for April 2012 18.3 % 38.3 % Internet Explorer Firefox Chrome Safari Opera 35.8 % Pie chart

More information

Principal Component Analysis

Principal Component Analysis I.T. Jolliffe Principal Component Analysis Second Edition With 28 Illustrations Springer Contents Preface to the Second Edition Preface to the First Edition Acknowledgments List of Figures List of Tables

More information

Vector Space Models. wine_spectral.r

Vector Space Models. wine_spectral.r Vector Space Models 137 wine_spectral.r Latent Semantic Analysis Problem with words Even a small vocabulary as in wine example is challenging LSA Reduce number of columns of DTM by principal components

More information

CS 540: Machine Learning Lecture 1: Introduction

CS 540: Machine Learning Lecture 1: Introduction CS 540: Machine Learning Lecture 1: Introduction AD January 2008 AD () January 2008 1 / 41 Acknowledgments Thanks to Nando de Freitas Kevin Murphy AD () January 2008 2 / 41 Administrivia & Announcement

More information

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i

More information

Dimension Reduction Methods

Dimension Reduction Methods Dimension Reduction Methods And Bayesian Machine Learning Marek Petrik 2/28 Previously in Machine Learning How to choose the right features if we have (too) many options Methods: 1. Subset selection 2.

More information

Machine Learning (Spring 2012) Principal Component Analysis

Machine Learning (Spring 2012) Principal Component Analysis 1-71 Machine Learning (Spring 1) Principal Component Analysis Yang Xu This note is partly based on Chapter 1.1 in Chris Bishop s book on PRML and the lecture slides on PCA written by Carlos Guestrin in

More information

Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Advanced Introduction to Machine Learning CMU-10715 Principal Component Analysis Barnabás Póczos Contents Motivation PCA algorithms Applications Some of these slides are taken from Karl Booksh Research

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1 Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger

More information

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation) Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation) PCA transforms the original input space into a lower dimensional space, by constructing dimensions that are linear combinations

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods. TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin

More information

PCA and admixture models

PCA and admixture models PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1

More information

MULTI-VARIATE/MODALITY IMAGE ANALYSIS

MULTI-VARIATE/MODALITY IMAGE ANALYSIS MULTI-VARIATE/MODALITY IMAGE ANALYSIS Duygu Tosun-Turgut, Ph.D. Center for Imaging of Neurodegenerative Diseases Department of Radiology and Biomedical Imaging duygu.tosun@ucsf.edu Curse of dimensionality

More information

Machine Learning (CS 567) Lecture 2

Machine Learning (CS 567) Lecture 2 Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University Lecture 4: Principal Component Analysis Aykut Erdem May 016 Hacettepe University This week Motivation PCA algorithms Applications PCA shortcomings Autoencoders Kernel PCA PCA Applications Data Visualization

More information

Machine Learning - MT & 14. PCA and MDS

Machine Learning - MT & 14. PCA and MDS Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning CS4731 Dr. Mihail Fall 2017 Slide content based on books by Bishop and Barber. https://www.microsoft.com/en-us/research/people/cmbishop/ http://web4.cs.ucl.ac.uk/staff/d.barber/pmwiki/pmwiki.php?n=brml.homepage

More information

Statistical aspects of prediction models with high-dimensional data

Statistical aspects of prediction models with high-dimensional data Statistical aspects of prediction models with high-dimensional data Anne Laure Boulesteix Institut für Medizinische Informationsverarbeitung, Biometrie und Epidemiologie February 15th, 2017 Typeset by

More information

Deriving Principal Component Analysis (PCA)

Deriving Principal Component Analysis (PCA) -0 Mathematical Foundations for Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Deriving Principal Component Analysis (PCA) Matt Gormley Lecture 11 Oct.

More information

Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018 Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours Instructions: STATS216v Introduction to Statistical Learning Stanford University, Summer 2017 Remember the university honor code. Midterm Exam (Solutions) Duration: 1 hours Write your name and SUNet ID

More information

20 Unsupervised Learning and Principal Components Analysis (PCA)

20 Unsupervised Learning and Principal Components Analysis (PCA) 116 Jonathan Richard Shewchuk 20 Unsupervised Learning and Principal Components Analysis (PCA) UNSUPERVISED LEARNING We have sample points, but no labels! No classes, no y-values, nothing to predict. Goal:

More information

Discriminant analysis and supervised classification

Discriminant analysis and supervised classification Discriminant analysis and supervised classification Angela Montanari 1 Linear discriminant analysis Linear discriminant analysis (LDA) also known as Fisher s linear discriminant analysis or as Canonical

More information

Machine Learning and Adaptive Systems. Lectures 3 & 4

Machine Learning and Adaptive Systems. Lectures 3 & 4 ECE656- Lectures 3 & 4, Professor Department of Electrical and Computer Engineering Colorado State University Fall 2015 What is Learning? General Definition of Learning: Any change in the behavior or performance

More information

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26 Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 1 / 26 Initial Question Intro Question Question Let S R n n be symmetric. 1

More information

Principal Component Analysis

Principal Component Analysis B: Chapter 1 HTF: Chapter 1.5 Principal Component Analysis Barnabás Póczos University of Alberta Nov, 009 Contents Motivation PCA algorithms Applications Face recognition Facial expression recognition

More information

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

COMP 551 Applied Machine Learning Lecture 2: Linear Regression COMP 551 Applied Machine Learning Lecture 2: Linear Regression Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise

More information

Dimensionality Reduction

Dimensionality Reduction Lecture 5 1 Outline 1. Overview a) What is? b) Why? 2. Principal Component Analysis (PCA) a) Objectives b) Explaining variability c) SVD 3. Related approaches a) ICA b) Autoencoders 2 Example 1: Sportsball

More information

Issues and Techniques in Pattern Classification

Issues and Techniques in Pattern Classification Issues and Techniques in Pattern Classification Carlotta Domeniconi www.ise.gmu.edu/~carlotta Machine Learning Given a collection of data, a machine learner eplains the underlying process that generated

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 02-01-2018 Biomedical data are usually high-dimensional Number of samples (n) is relatively small whereas number of features (p) can be large Sometimes p>>n Problems

More information

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component

More information

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017 COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University OVERVIEW This class will cover model-based

More information

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) Principal Component Analysis (PCA) Salvador Dalí, Galatea of the Spheres CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang Some slides from Derek Hoiem and Alysha

More information

Multimodal Deep Learning for Predicting Survival from Breast Cancer

Multimodal Deep Learning for Predicting Survival from Breast Cancer Multimodal Deep Learning for Predicting Survival from Breast Cancer Heather Couture Deep Learning Journal Club Nov. 16, 2016 Outline Background on tumor histology & genetic data Background on survival

More information

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Laurenz Wiskott Institute for Theoretical Biology Humboldt-University Berlin Invalidenstraße 43 D-10115 Berlin, Germany 11 March 2004 1 Intuition Problem Statement Experimental

More information

Quantitative Understanding in Biology Principal Components Analysis

Quantitative Understanding in Biology Principal Components Analysis Quantitative Understanding in Biology Principal Components Analysis Introduction Throughout this course we have seen examples of complex mathematical phenomena being represented as linear combinations

More information

Machine Learning 2nd Edition

Machine Learning 2nd Edition INTRODUCTION TO Lecture Slides for Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some parts from http://www.cs.tau.ac.il/~apartzin/machinelearning/ The MIT Press, 2010

More information

Principal Component Analysis, A Powerful Scoring Technique

Principal Component Analysis, A Powerful Scoring Technique Principal Component Analysis, A Powerful Scoring Technique George C. J. Fernandez, University of Nevada - Reno, Reno NV 89557 ABSTRACT Data mining is a collection of analytical techniques to uncover new

More information

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010 Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X

More information

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision) CS4495/6495 Introduction to Computer Vision 8B-L2 Principle Component Analysis (and its use in Computer Vision) Wavelength 2 Wavelength 2 Principal Components Principal components are all about the directions

More information

PRINCIPAL COMPONENTS ANALYSIS

PRINCIPAL COMPONENTS ANALYSIS 121 CHAPTER 11 PRINCIPAL COMPONENTS ANALYSIS We now have the tools necessary to discuss one of the most important concepts in mathematical statistics: Principal Components Analysis (PCA). PCA involves

More information

Noise & Data Reduction

Noise & Data Reduction Noise & Data Reduction Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum Dimension Reduction 1 Remember: Central Limit

More information

Need for Several Predictor Variables

Need for Several Predictor Variables Multiple regression One of the most widely used tools in statistical analysis Matrix expressions for multiple regression are the same as for simple linear regression Need for Several Predictor Variables

More information

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1). Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

7. Variable extraction and dimensionality reduction

7. Variable extraction and dimensionality reduction 7. Variable extraction and dimensionality reduction The goal of the variable selection in the preceding chapter was to find least useful variables so that it would be possible to reduce the dimensionality

More information

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics

More information

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January, 1 Stat 406: Algorithms for classification and prediction Lecture 1: Introduction Kevin Murphy Mon 7 January, 2008 1 1 Slides last updated on January 7, 2008 Outline 2 Administrivia Some basic definitions.

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information

IV. Matrix Approximation using Least-Squares

IV. Matrix Approximation using Least-Squares IV. Matrix Approximation using Least-Squares The SVD and Matrix Approximation We begin with the following fundamental question. Let A be an M N matrix with rank R. What is the closest matrix to A that

More information

Principal component analysis PCA. Kathleen Marchal Dept Plant Biotechnology and Bioinformatics Department information technology (INTEC)

Principal component analysis PCA. Kathleen Marchal Dept Plant Biotechnology and Bioinformatics Department information technology (INTEC) Principal component analysis PCA Kathleen Marchal Dept Plant Biotechnology and Bioinformatics Department information technology (INTEC) Overzicht lessen 26/02 13h biostat S3 emile clapeyron 07/03 ma 13

More information

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees! Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees! Summary! Input Knowledge representation! Preparing data for learning! Input: Concept, Instances, Attributes"

More information

Loss Functions, Decision Theory, and Linear Models

Loss Functions, Decision Theory, and Linear Models Loss Functions, Decision Theory, and Linear Models CMSC 678 UMBC January 31 st, 2018 Some slides adapted from Hamed Pirsiavash Logistics Recap Piazza (ask & answer questions): https://piazza.com/umbc/spring2018/cmsc678

More information

Linear Regression. Volker Tresp 2014

Linear Regression. Volker Tresp 2014 Linear Regression Volker Tresp 2014 1 Learning Machine: The Linear Model / ADALINE As with the Perceptron we start with an activation functions that is a linearly weighted sum of the inputs h i = M 1 j=0

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

Decision-making, inference, and learning theory. ECE 830 & CS 761, Spring 2016

Decision-making, inference, and learning theory. ECE 830 & CS 761, Spring 2016 Decision-making, inference, and learning theory ECE 830 & CS 761, Spring 2016 1 / 22 What do we have here? Given measurements or observations of some physical process, we ask the simple question what do

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

Lecture 7: Con3nuous Latent Variable Models

Lecture 7: Con3nuous Latent Variable Models CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 7: Con3nuous Latent Variable Models All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/

More information

1 Principal Components Analysis

1 Principal Components Analysis Lecture 3 and 4 Sept. 18 and Sept.20-2006 Data Visualization STAT 442 / 890, CM 462 Lecture: Ali Ghodsi 1 Principal Components Analysis Principal components analysis (PCA) is a very popular technique for

More information

CSC2545 Topics in Machine Learning: Kernel Methods and Support Vector Machines

CSC2545 Topics in Machine Learning: Kernel Methods and Support Vector Machines CSC2545 Topics in Machine Learning: Kernel Methods and Support Vector Machines A comprehensive introduc@on to SVMs and other kernel methods, including theory, algorithms and applica@ons. Instructor: Anthony

More information

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection Instructor: Herke van Hoof (herke.vanhoof@cs.mcgill.ca) Based on slides by:, Jackie Chi Kit Cheung Class web page:

More information

Bayesian Networks Inference with Probabilistic Graphical Models

Bayesian Networks Inference with Probabilistic Graphical Models 4190.408 2016-Spring Bayesian Networks Inference with Probabilistic Graphical Models Byoung-Tak Zhang intelligence Lab Seoul National University 4190.408 Artificial (2016-Spring) 1 Machine Learning? Learning

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Linear Methods for Classification

Linear Methods for Classification Linear Methods for Classification Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Classification Supervised learning Training data: {(x 1, g 1 ), (x 2, g 2 ),..., (x

More information

p(d θ ) l(θ ) 1.2 x x x

p(d θ ) l(θ ) 1.2 x x x p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to

More information

Multiple Regression. Dr. Frank Wood. Frank Wood, Linear Regression Models Lecture 12, Slide 1

Multiple Regression. Dr. Frank Wood. Frank Wood, Linear Regression Models Lecture 12, Slide 1 Multiple Regression Dr. Frank Wood Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 12, Slide 1 Review: Matrix Regression Estimation We can solve this equation (if the inverse of X

More information

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi Overview Introduction Linear Methods for Dimensionality Reduction Nonlinear Methods and Manifold

More information

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA) CS68: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA) Tim Roughgarden & Gregory Valiant April 0, 05 Introduction. Lecture Goal Principal components analysis

More information