Homework 6 Due: 10am Thursday 11/30/17 1. Hinge loss vs. logistic loss. In class we defined hinge loss l hinge (x, y; w) = (1 yw T x) + and logistic loss l logistic (x, y; w) = log(1 + exp ( yw T x ) ). Suppose we want to minimize the regularized empirical risk min 1 n l(x i, y i ; w) + λ w 2 2, where λ = 1. In this problem, we see how each of these loss functions performs on a binary classification problem. The problem is to predict if a breast tumor is benign or malignant based on its features. The dataset, breast-cancer.csv, can be found at https://github.com/orie4741/homework/breast-cancer.csv The dataset consists of 683 data points. The first column is the class ( 1: benign, 1: malignant), and the following 9 columns are the features. (a) In class, we defined the subgradient f of a function f : R R, which generalizes the gradient for non-differentiable losses. It maps points to sets. It s easiest to compute using the following definition: if f is differentiable at x, f(x) = { f(x)} if f is not differentiable at x, let g + = lim ɛ 0 f(x + ɛ), g = lim ɛ 0 f(x ɛ). f(x) is any convex combination of those gradients: f(x) = {αg + + (1 α)g : α [0, 1]} 1
Write the subgradient of hinge loss and logistic loss, respectively. Feel free to make a piecewise definition. (b) The proximal subgradient method works exactly as the proximal gradient method, except that we choose an (arbitrary) element from the subgradient of the loss function instead of the gradient of the loss function. Write pseudocode for the proximal subgradient method applied to the the problem above with hinge loss and logistic loss, respectively. (c) Split the data set randomly into training (50%) and test (50%) set. Run your proximal gradient method on training set to find minimizers w hinge and w logistic. (d) Remember the misclassification rate is defined as 1 n 1(ŷ i y i ), where ŷ i is your prediction for test data point i, and 1(ŷ i y i ) is 1 when ŷ i y i and 0 otherwise. Report the misclassification rates of w hinge and w logistic on the test set. Which model performs better? Hint. You may find the Julia function readtable useful to read the data. To run the proximal gradient method, you may use the proxgrad function posted at https://github.com/orie4741/demos/proxgrad.jl You can include this file in your code by making sure the file is in the same directory that Julia is running from, and calling include("proxgrad.jl"). (e) Logistic loss can be interpreted as the negative log likelihood of y given w T x so l logistic (x, y; w) = log P logistic (x, y; w), exp ( l logistic (x, y; w)) = P logistic (x, y; w). Similarly, we can give hinge loss a probabilistic interpretation: where 1 z(x; w) exp ( l hinge(x, y; w)) = P hinge (x, y; w), z(x; w) = exp ( l hinge (x, 1; w)) + exp ( l hinge (x, 1; w)) is the normalizing constant. Why is there no normalizing constant for logistic loss? (f) Compute the log likelihood of these two models log(p logistic (x i, y i ; w logistic )) 2
and log(p hinge (x i, y i ; w hinge )) using the test data set and report the log likelihood. Which one is larger? 2. Multiclass classification and ordinal regression. In this problem, we will study some important properties of loss functions for multiclass classification and ordinal regression. (a) In class we have defined the multinomial logit function as follows. Let W R k d x R d, so W x R k. Define P(y = i z) = exp (z i ) k j=1 exp (z j), where z = W x. (See page 37 of the loss function slides for details.) Define the imputed region for class i as A i = {x : P(y = i W x) P(y = j W x), j Y}. Explain what the imputed region represents, and show that each imputed region A i is convex. As a reminder, a set S is convex if for any x S, y S, and 0 λ 1, λx + (1 λ)y S. (b) One-vs-all classification. In the one-vs-all classification scheme, we define a loss function as k l(y, z) = l bin (ψ(y) i, z i ), where ψ(y) = ( 1,..., yth entry {}}{ 1,..., 1) { 1, 1} k. Here we will use logistic loss as our binary loss function l bin (ψ i, z i ) = l logistic (ψ i, z i ) = log(1 + exp ( ψ i z i )). (See the loss function slides on multiclass classification for details.) Prove the following inequality and explain what it means: l(i, ψ(i)) l(j, ψ(i)), i, j Y. 3
(c) Ordinal regression. One method for ordinal regression is to define a loss function where k 1 l(y, z) = l bin (ψ(y) i, z i ), ψ(y) = (1(y > 1), 1(y > 2),..., 1(y > k 1)) R k 1. Again, we will use logistic loss as our binary loss function l bin. (See page 42 of the loss function slides for details.) Prove the following inequalities hold, and explain what they mean: l(i, ψ(i)) l(j, ψ(i)), i, j Y. l(i + 1, ψ(i)) l(i + 2, ψ(i)), i Y. 3. Grading by Matrix Completion. There are m ORIE 4741 project groups with n students, and each student is responsible for grading several projects. Each project has an underlying quality; some are good, some less good. Some students are fair graders, and report the project quality as their grade. Some are easy graders, and report a higher grade. Some are harsh graders, and report a lower grade. We ll collect the grades into a grade matrix A R m n : A ij will represent the grade that student j would assign to project i. Of course, we cannot assign each student to grade every project. Instead, we make peer review assignments Ω = {(i 1, j 1 ),..., }. Here, (i, j) Ω if student j is assigned to grade project i. Let us suppose each project is graded by p peers. Unfortunately, this means that some projects are assigned harder graders than other projects. Our goal is to find a fair way to compute a project s final grade. We consider two methods: (a) Averaging. The grade g i for project i is the average of the grades given by peer reviewers: g avg i = 1 p j:(i,j) Ω (b) Matrix completion. We fit a low rank model to the grade matrix and use it to compute an estimate  Rm n of the grade matrix. To be more concrete, let s suppose that we find  by fitting a rank-1 model. We will use Huber loss, for robustness against outlier grades, and nonnegative regularization, since both student grading toughness and project quality are nonnegative. minimize huber(a ij x i y j ) + 1(x 0) + 1(y 0), (i,j) Ω 4 A ij
where x R m and y R n. We compute our estimate  as  = xy T. In other words,  is the rank-1 matrix that matches the observations best in the sense of Huber error. We compute the grade g i for project i as the average of these estimated grades: g mc i = 1 n j=1  ij In this problem, we will consider which of these two grading schemes, averaging or matrix completion, is better. Code for this problem can be found at https://github.com/orie4741/homework/blob/master/grading_by_matrix_ completion.ipynb (a) Analytical problem. Consider m = 2 project groups and n = 4 peer graders. Suppose group 1 did well on their project and deserves a grade of 6; whereas group 2 deserves a grade of 3. Graders 1 and 2 are easy graders, and graders 3 and 4 are harsh. Each project is graded by three graders. The grades given are [ ] 8 4 4 X =. 4 4 2 Here, an in the (i, j)th entry means the jth student was not responsible for grading the ith project. Use both methods, averaging and matrix completion, to compute grades for the two groups. Here, you should be able to compute the results of both methods by hand (on paper). Explain how you computed Â. Compare your results. Which grading method would you say is more fair? (b) A more realistic example. Let s generate a more realistic example of a grade matrix and observation matrix. We use the following code to construct a rank-1 grade matrix with 40 rows and 120 columns, with true project quality scores ranging from 3 to 8 and student easyness index (ratio of the given score to the true score) ranging from 0.5 to 1.5. Each group is graded by 6 graders. Describe in words the structure of the true grades matrix generated by the in the jupyter notebook. What rank does it have? 5
(c) Fit a low rank model. Using the LowRankModels package, fit a rank-1 model for this grade matrix using a huber loss function and a nonnegative regularizer. Use your model to compute an estimated grade matrix Â. (d) Grade the projects. Compute final grades for all 40 projects using both the averaging and matrix completion methods. Compare the results. Which method would you say is more fair? (e) (Extra credit) Distributions. Try some other distributions for grades by changing the way we generated data, or changing how students are assigned to grade projects. Do the results change? (f) (Extra credit) Low rank models. Try some other matrix completion models, using different loss functions, regularizers, or ranks, initializing the models using different tricks, or changing the parameters of the optimization algorithm used by LowRankModels. Which work better and which work less well? Why do you think that is? 6