Final project STK4030/ Statistical learning: Advanced regression and classification Fall 2015

Similar documents
Machine Learning Linear Classification. Prof. Matteo Matteucci

The prediction of house price

CS 229 Final report A Study Of Ensemble Methods In Machine Learning

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

A Modern Look at Classical Multivariate Techniques

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Regularized Discriminant Analysis and Reduced-Rank LDA

Support Vector Machines for Classification: A Statistical Portrait

Focused fine-tuning of ridge regression

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

ISyE 691 Data mining and analytics

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

CS-E3210 Machine Learning: Basic Principles

Machine Learning for OR & FE

Harrison B. Prosper. Bari Lectures

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

Discriminant analysis and supervised classification

Machine Learning Linear Regression. Prof. Matteo Matteucci

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Classification: Linear Discriminant Analysis

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Lecture 9: Classification, LDA

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

The Bayes classifier

Lecture 9: Classification, LDA

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Non-parametric Classification of Facial Features

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2016

Lecture 9: Classification, LDA

High-dimensional regression modeling

A Selective Review of Sufficient Dimension Reduction

CMSC858P Supervised Learning Methods

Least Squares Classification

Harrison B. Prosper. Bari Lectures

Linear Methods for Regression. Lijun Zhang

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Linear regression methods

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Lecture 6: Linear Regression (continued)

Lecture 6: Methods for high-dimensional problems

DISCRIMINATION AND CLASSIFICATION IN NIR SPECTROSCOPY. 1 Dept. Chemistry, University of Rome La Sapienza, Rome, Italy

Learning with Singular Vectors

Basics of Multivariate Modelling and Data Analysis

Robot Image Credit: Viktoriya Sukhanova 123RF.com. Dimensionality Reduction

Issues and Techniques in Pattern Classification

Experimental Design and Data Analysis for Biologists

UNIVERSITETET I OSLO

Principal Component Analysis (PCA)

Linear Model Selection and Regularization

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Dimensionality reduction

Theorems. Least squares regression

Introduction to Machine Learning

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

Recap from previous lecture

Principal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014

MSA220 Statistical Learning for Big Data

Dimension Reduction Methods

Introduction to Machine Learning Spring 2018 Note 18

Regularization and Variable Selection via the Elastic Net

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

CPSC 340: Machine Learning and Data Mining

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Engineering 7: Introduction to computer programming for scientists and engineers

Lecture 14: Shrinkage

Introduction to Machine Learning

Introduction to Machine Learning Midterm, Tues April 8

ECE 661: Homework 10 Fall 2014

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Regularized Discriminant Analysis. Part I. Linear and Quadratic Discriminant Analysis. Discriminant Analysis. Example. Example. Class distribution

Example: Face Detection

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Feature Engineering, Model Evaluations

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Machine learning for pervasive systems Classification in high-dimensional spaces

Introduction to Machine Learning

8. Classification and Pattern Recognition

Train the model with a subset of the data. Test the model on the remaining data (the validation set) What data to choose for training vs. test?

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Biostatistics Advanced Methods in Biostatistics IV

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Chemometrics. Matti Hotokka Physical chemistry Åbo Akademi University

Generative Model (Naïve Bayes, LDA)

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Statistical Learning

Degrees of Freedom in Regression Ensembles

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

Linear Regression Linear Regression with Shrinkage

COMS 4771 Regression. Nakul Verma

STK-IN4300 Statistical Learning Methods in Data Science

CITS 4402 Computer Vision

CLASSICAL NORMAL-BASED DISCRIMINANT ANALYSIS

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

Transcription:

Final project STK4030/9030 - Statistical learning: Advanced regression and classification Fall 2015 Available Monday November 2nd. Handed in: Monday November 23rd at 13.00 November 16, 2015 This is the problem set for the project part of the finals in STK4030 fall 2015. The reports shall be individually written. Three copies marked with your candidate number shall be placed in Kristoffer Hellton s post box at room B700 at the seventh floor in Niels Henrik Abels hus. Handwritten reports are acceptable. Enclose the parts of the computer outputs which are necessary for answering the questions. The other parts can be collected in appendices. When you refer to material in these, be careful to indicate explicitly where. Importantly, each student needs to submit a special extra page with her or his report. This page is the self-declaration form, properly signed, and with the appropriate course form STK4030 (master level) or STK 9030 (PhD level) clearly marked; it is available at the webpage as Exam Project, declaration form. All data sets and R scripts are available at the course webpage. Kristoffer H. Hellton 1 Problem 1 - Vino verde wine quality A Portugese wine seller wants your help in constructing a predictor for wine taste preferences. He has collected easily available physiochemical data and a quality assessment (between 0 and 10) by human wine experts of 600 Vino verde white wine samples. The data set wine.rdata contains the response quality and 11 covariates 1 (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, ph, sulphates, alcohol). The data set is divided into a training and a test set of 300 samples each, and the last column, test, indicates whether a sample is in the test set. Before the analysis, center the mean and scale the variance of the covariates in the whole data set using all 600 samples. 1 Cortez et al. (2009) Modeling wine preferences by data mining from physicochemical properties 1

(a) Estimate a linear regression model with quality as the response with all 11 predictors by (ordinary) least squares (OLS), and predict the quality on the 300 test samples. Which covariates have the strongest association with the wine quality? (b) Estimate the same linear regression model by ridge regression, where the optimal tuning parameter λ is chosen by 10-fold cross-validation, and predict the quality on the 300 test samples. Plot the cross-validated mean squared error against the logarithm with base 10 of the λ values. What is the optimal value of λ? (c) Estimate the same linear regression model by lasso where the optimal tuning parameter λ is chosen by 10-fold cross-validation, and predict the quality on the 300 test samples. Plot the cross-validated mean squared error against the logarithm with base 10 of the λ values. What is the optimal value of λ? (d) Use the test data set for validation and compare the OLS, ridge and lasso in terms of the mean squared error on the test data. Which predictor would you recommend to the wine seller? How will the estimated mean squared error on the test data for your chosen predictor compare to the true test error? 2 Problem 2 - Fossils Bralower et al. (1997) Mid-Cretaceous strontium-isotope stratigraphy of deep-sea sections investigated how fossils of different age have varying ratios of strontium isotopes. The characterization of the strontium levels is important for understanding the production rate of the oceanic crust. The change of the ratio between strontium isotopes over time is highly non-linear. 2

The data set fossil.rdata contains the ratio between Strontium-87 and Strontium-86 and the age (in millions of years) for 106 fossil samples. You will estimate a non-linear semi-parametric regression model with age as input and strontium.ratio as response. (a) Construct a B-spline basis of order 4 for the input with external boundary knots at the range of age, and 40 internal knots located i) equidistantly, with a distance 0.761 apart ii) at equidistant quantiles of the distribution of age Plot both these sets of basis functions. (b) Use the B-splines with quantile-based knots as a spline basis and estimate the smoothing splines regression curve for the relationship between age and strontium.ratio. Select the smoothing parameter both using leave-one-out cross-validation and the generalized cross-validation (GCV) criterion. Report the estimated smoothing parameters. Plot the data and both smoothing spline curves. What do you predict the strontium ratio to be in a 113.5 million year old sample? 3 Problem 3 - Face recognition The data set faces.rdata (courtesy of Håvard Rue, NTNU) contains portraits or grey level images of n = 200 medical students at Stanford University, where i = 1,..., 100 are male portraits and i = 101,..., 200 are female portraits. Each portrait can be thought of as a matrix of 100 100 pixels and is recorded as a vector of length p = 100 2 = 10000, treating each pixel as a variable. The function display.matrix (found on the course webpage) scales and maps each pixel to a grey value (black being 0 and white being 1 ) and can be used to display a 10000 1 vector as an image. The data matrix faces is therefore given on the form p n and the images of the 101st, 102nd, 1st and 2nd column of faces are shown as (a) Estimate the mean of the portraits for each gender, and display them as 100 100 images. Does the mean vary across the image? Which features are represented in the estimated mean portraits? 3

The portraits are a mix of images taken at the shoulders and at the chin, such that the relative size of the face varies significantly. The variable shoulder indicates whether a portrait includes the shoulders or not. (b) Find the principal components of the whole data set and display the first three eigenvectors (the principal component directions) as 100 100 images. The images of the eigenvector are usually referred to as eigenfaces in context of face recognition. What do the first three components represent in terms of facial or image features? How many principal components are needed to express 80% of the variation in the data set? (c) Find the first three principal components (n 1 vectors) of the whole data set. 2 Plot the first and second principal components and the first and third principal components against each other and color the observations according to gender. Then color the plots according to the shoulders being present or not. What do you conclude? Display all four figures. (d) Recode gender as an indicator variable { 1, if G i = male, y i = 1, if G i = female. Construct a classifier for gender using principal component regression (PCR) with y i as the response, and classify an observation as female if ˆf(x i ) > 0. Use leave-one-out cross-validation to select the number of components, m < 200, and select the smallest m if the crossvalidation error has several minima. What is the misclassification rate (the error with a 1-0 loss function on the training data)? (e) Construct a classifier for gender using partial least square (PLS) with y i as the response, and classify an observation as female if ˆf(x i ) > 0. Use leave-one-out crossvalidation to select the number of PLS directions, m < 200, and select the smallest m if the crossvalidation error has several minima. What is the misclassification rate (the error with a 1-0 loss function on the training data) now? Why does the optimal m for PCR and PLS differ? Ordinary linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) cannot be used in a high-dimensional setting with p > n, as the sample covariance matrix will be singular. You will now explore two ways to avoid this issue; reduce the dimension and regularizing the covariance matrix. (f) Let the whole data set (10000 200) to be represented by the first five principal components (5 200). Use the reduced data as input for QDA and construct a classifier for gender. What is misclassification rate on the training data? Plot the quadratic decision boundaries in the figures from 3(c). 2 The first three columns of XV = UD in the notation of the singular value decomposition. 4

(g) Now consider the gender and the shoulder categories together as four different categories (maleshoulder, malenoshoulder, femaleshoulder and femalenoshoulder). Use the first five principal components (5 200) to construct a QDA classifier for the four gender-shoulder categories and classify the training data. Merge the classifications for the groups with and without shoulders within each gender, such that you end up with a classification for gender. Compare the result to the classification in (f). Does the misclassification rate decrease when taking into account the shoulders? (h) Finally, return to the whole data set and use a version of regularized LDA to classify only gender. Motivate how you regularize the sample covariance matrix and argue for your choice of tuning (penalty) parameter. What is the misclassification rate? 5