Homework 6. Due: 10am Thursday 11/30/17

Similar documents
ORIE 4741 Final Exam

Classification objectives COMS 4771

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

1 Machine Learning Concepts (16 points)

Logistic Regression Trained with Different Loss Functions. Discussion

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Classification Based on Probability

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018

MATH 680 Fall November 27, Homework 3

CPSC 340: Machine Learning and Data Mining

Introduction to Machine Learning

Introduction to Machine Learning (67577) Lecture 3

CPSC 340 Assignment 4 (due November 17 ATE)

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Warm up: risk prediction with logistic regression

Linear Models in Machine Learning

Machine Learning Linear Classification. Prof. Matteo Matteucci

Homework 5. Convex Optimization /36-725

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Machine Learning for NLP

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Decoupled Collaborative Ranking

Classification: Logistic Regression from Data

Homework 4. Convex Optimization /36-725

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Homework 2. Convex Optimization /36-725

Evaluation requires to define performance measures to be optimized

Announcements - Homework

HOMEWORK 4: SVMS AND KERNELS

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Logistic Regression. Machine Learning Fall 2018

Convex Optimization / Homework 1, due September 19

Introduction to Machine Learning

Classification: Logistic Regression from Data

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Machine Learning for NLP

10701/15781 Machine Learning, Spring 2007: Homework 2

DATA MINING AND MACHINE LEARNING

Adversarial Surrogate Losses for Ordinal Regression

Machine Learning And Applications: Supervised Learning-SVM

IFT Lecture 7 Elements of statistical learning theory

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Midterm: CS 6375 Spring 2015 Solutions

Kernel Logistic Regression and the Import Vector Machine

Machine Learning 4771

Generalized logit models for nominal multinomial responses. Local odds ratios

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Evaluation. Andrea Passerini Machine Learning. Evaluation

Logistic Regression. COMP 527 Danushka Bollegala

CS229 Supplemental Lecture notes

Machine Learning (CS 567) Lecture 5

6.867 Machine Learning

Homework #3 RELEASE DATE: 10/28/2013 DUE DATE: extended to 11/18/2013, BEFORE NOON QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FORUM.

ECE521 Lectures 9 Fully Connected Neural Networks

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

HOMEWORK #4: LOGISTIC REGRESSION

Homework 3. Convex Optimization /36-725

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Gaussian and Linear Discriminant Analysis; Multiclass Classification

10-701/ Machine Learning - Midterm Exam, Fall 2010

Announcements Kevin Jamieson

Homework 5: Conditional Probability Models

1-bit Matrix Completion. PAC-Bayes and Variational Approximation

CS489/698: Intro to ML

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

4 Bias-Variance for Ridge Regression (24 points)

Neural Networks: Backpropagation

Support Vector Machines and Bayes Regression

HOMEWORK #4: LOGISTIC REGRESSION

Empirical Risk Minimization

Machine Learning Lecture 7

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Classification. Chapter Introduction. 6.2 The Bayes classifier

Least Squares Regression

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Loss Functions and Optimization. Lecture 3-1

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

6.867 Machine Learning

Advanced Introduction to Machine Learning

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Logistic Regression. INFO-2301: Quantitative Reasoning 2 Michael Paul and Jordan Boyd-Graber SLIDES ADAPTED FROM HINRICH SCHÜTZE

Lecture 3: Multiclass Classification

Binary Classification / Perceptron

CS Homework 3. October 15, 2009

Bindel, Fall 2011 Intro to Scientific Computing (CS 3220) Week 3: Wednesday, Jan 9

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Midterm Exam, Spring 2005

Lecture 4: Training a Classifier

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Transcription:

Homework 6 Due: 10am Thursday 11/30/17 1. Hinge loss vs. logistic loss. In class we defined hinge loss l hinge (x, y; w) = (1 yw T x) + and logistic loss l logistic (x, y; w) = log(1 + exp ( yw T x ) ). Suppose we want to minimize the regularized empirical risk min 1 n l(x i, y i ; w) + λ w 2 2, where λ = 1. In this problem, we see how each of these loss functions performs on a binary classification problem. The problem is to predict if a breast tumor is benign or malignant based on its features. The dataset, breast-cancer.csv, can be found at https://github.com/orie4741/homework/breast-cancer.csv The dataset consists of 683 data points. The first column is the class ( 1: benign, 1: malignant), and the following 9 columns are the features. (a) In class, we defined the subgradient f of a function f : R R, which generalizes the gradient for non-differentiable losses. It maps points to sets. It s easiest to compute using the following definition: if f is differentiable at x, f(x) = { f(x)} if f is not differentiable at x, let g + = lim ɛ 0 f(x + ɛ), g = lim ɛ 0 f(x ɛ). f(x) is any convex combination of those gradients: f(x) = {αg + + (1 α)g : α [0, 1]} 1

Write the subgradient of hinge loss and logistic loss, respectively. Feel free to make a piecewise definition. (b) The proximal subgradient method works exactly as the proximal gradient method, except that we choose an (arbitrary) element from the subgradient of the loss function instead of the gradient of the loss function. Write pseudocode for the proximal subgradient method applied to the the problem above with hinge loss and logistic loss, respectively. (c) Split the data set randomly into training (50%) and test (50%) set. Run your proximal gradient method on training set to find minimizers w hinge and w logistic. (d) Remember the misclassification rate is defined as 1 n 1(ŷ i y i ), where ŷ i is your prediction for test data point i, and 1(ŷ i y i ) is 1 when ŷ i y i and 0 otherwise. Report the misclassification rates of w hinge and w logistic on the test set. Which model performs better? Hint. You may find the Julia function readtable useful to read the data. To run the proximal gradient method, you may use the proxgrad function posted at https://github.com/orie4741/demos/proxgrad.jl You can include this file in your code by making sure the file is in the same directory that Julia is running from, and calling include("proxgrad.jl"). (e) Logistic loss can be interpreted as the negative log likelihood of y given w T x so l logistic (x, y; w) = log P logistic (x, y; w), exp ( l logistic (x, y; w)) = P logistic (x, y; w). Similarly, we can give hinge loss a probabilistic interpretation: where 1 z(x; w) exp ( l hinge(x, y; w)) = P hinge (x, y; w), z(x; w) = exp ( l hinge (x, 1; w)) + exp ( l hinge (x, 1; w)) is the normalizing constant. Why is there no normalizing constant for logistic loss? (f) Compute the log likelihood of these two models log(p logistic (x i, y i ; w logistic )) 2

and log(p hinge (x i, y i ; w hinge )) using the test data set and report the log likelihood. Which one is larger? 2. Multiclass classification and ordinal regression. In this problem, we will study some important properties of loss functions for multiclass classification and ordinal regression. (a) In class we have defined the multinomial logit function as follows. Let W R k d x R d, so W x R k. Define P(y = i z) = exp (z i ) k j=1 exp (z j), where z = W x. (See page 37 of the loss function slides for details.) Define the imputed region for class i as A i = {x : P(y = i W x) P(y = j W x), j Y}. Explain what the imputed region represents, and show that each imputed region A i is convex. As a reminder, a set S is convex if for any x S, y S, and 0 λ 1, λx + (1 λ)y S. (b) One-vs-all classification. In the one-vs-all classification scheme, we define a loss function as k l(y, z) = l bin (ψ(y) i, z i ), where ψ(y) = ( 1,..., yth entry {}}{ 1,..., 1) { 1, 1} k. Here we will use logistic loss as our binary loss function l bin (ψ i, z i ) = l logistic (ψ i, z i ) = log(1 + exp ( ψ i z i )). (See the loss function slides on multiclass classification for details.) Prove the following inequality and explain what it means: l(i, ψ(i)) l(j, ψ(i)), i, j Y. 3

(c) Ordinal regression. One method for ordinal regression is to define a loss function where k 1 l(y, z) = l bin (ψ(y) i, z i ), ψ(y) = (1(y > 1), 1(y > 2),..., 1(y > k 1)) R k 1. Again, we will use logistic loss as our binary loss function l bin. (See page 42 of the loss function slides for details.) Prove the following inequalities hold, and explain what they mean: l(i, ψ(i)) l(j, ψ(i)), i, j Y. l(i + 1, ψ(i)) l(i + 2, ψ(i)), i Y. 3. Grading by Matrix Completion. There are m ORIE 4741 project groups with n students, and each student is responsible for grading several projects. Each project has an underlying quality; some are good, some less good. Some students are fair graders, and report the project quality as their grade. Some are easy graders, and report a higher grade. Some are harsh graders, and report a lower grade. We ll collect the grades into a grade matrix A R m n : A ij will represent the grade that student j would assign to project i. Of course, we cannot assign each student to grade every project. Instead, we make peer review assignments Ω = {(i 1, j 1 ),..., }. Here, (i, j) Ω if student j is assigned to grade project i. Let us suppose each project is graded by p peers. Unfortunately, this means that some projects are assigned harder graders than other projects. Our goal is to find a fair way to compute a project s final grade. We consider two methods: (a) Averaging. The grade g i for project i is the average of the grades given by peer reviewers: g avg i = 1 p j:(i,j) Ω (b) Matrix completion. We fit a low rank model to the grade matrix and use it to compute an estimate  Rm n of the grade matrix. To be more concrete, let s suppose that we find  by fitting a rank-1 model. We will use Huber loss, for robustness against outlier grades, and nonnegative regularization, since both student grading toughness and project quality are nonnegative. minimize huber(a ij x i y j ) + 1(x 0) + 1(y 0), (i,j) Ω 4 A ij

where x R m and y R n. We compute our estimate  as  = xy T. In other words,  is the rank-1 matrix that matches the observations best in the sense of Huber error. We compute the grade g i for project i as the average of these estimated grades: g mc i = 1 n j=1  ij In this problem, we will consider which of these two grading schemes, averaging or matrix completion, is better. Code for this problem can be found at https://github.com/orie4741/homework/blob/master/grading_by_matrix_ completion.ipynb (a) Analytical problem. Consider m = 2 project groups and n = 4 peer graders. Suppose group 1 did well on their project and deserves a grade of 6; whereas group 2 deserves a grade of 3. Graders 1 and 2 are easy graders, and graders 3 and 4 are harsh. Each project is graded by three graders. The grades given are [ ] 8 4 4 X =. 4 4 2 Here, an in the (i, j)th entry means the jth student was not responsible for grading the ith project. Use both methods, averaging and matrix completion, to compute grades for the two groups. Here, you should be able to compute the results of both methods by hand (on paper). Explain how you computed Â. Compare your results. Which grading method would you say is more fair? (b) A more realistic example. Let s generate a more realistic example of a grade matrix and observation matrix. We use the following code to construct a rank-1 grade matrix with 40 rows and 120 columns, with true project quality scores ranging from 3 to 8 and student easyness index (ratio of the given score to the true score) ranging from 0.5 to 1.5. Each group is graded by 6 graders. Describe in words the structure of the true grades matrix generated by the in the jupyter notebook. What rank does it have? 5

(c) Fit a low rank model. Using the LowRankModels package, fit a rank-1 model for this grade matrix using a huber loss function and a nonnegative regularizer. Use your model to compute an estimated grade matrix Â. (d) Grade the projects. Compute final grades for all 40 projects using both the averaging and matrix completion methods. Compare the results. Which method would you say is more fair? (e) (Extra credit) Distributions. Try some other distributions for grades by changing the way we generated data, or changing how students are assigned to grade projects. Do the results change? (f) (Extra credit) Low rank models. Try some other matrix completion models, using different loss functions, regularizers, or ranks, initializing the models using different tricks, or changing the parameters of the optimization algorithm used by LowRankModels. Which work better and which work less well? Why do you think that is? 6