Machine Learning: Assignment 1

Similar documents
Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning Midterm Exam

10-701/ Machine Learning - Midterm Exam, Fall 2010

Machine Learning, Fall 2012 Homework 2

Machine Learning: Assignment 2

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Linear Models in Machine Learning

Machine Learning Practice Page 2 of 2 10/28/13

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Introduction to Machine Learning Midterm Exam Solutions

Machine Learning: Assignment 3

Advanced Introduction to Machine Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Introduction to Machine Learning Midterm, Tues April 8

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Midterm exam CS 189/289, Fall 2015

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2.

The Bayes classifier

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Announcements - Homework

Generative v. Discriminative classifiers Intuition

Midterm Exam, Spring 2005

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Logistic Regression. Machine Learning Fall 2018

Introduction to Logistic Regression and Support Vector Machine

Machine Learning, Fall 2009: Midterm

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

10-701/15-781, Machine Learning: Homework 4

CS340 Winter 2010: HW3 Out Wed. 2nd February, due Friday 11th February

LEC 4: Discriminant Analysis for Classification

HOMEWORK #4: LOGISTIC REGRESSION

CMU-Q Lecture 24:

Bias-Variance Tradeoff

Classification Logistic Regression

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

Ad Placement Strategies

Introduction to Machine Learning

CPSC 340: Machine Learning and Data Mining

STA 4273H: Statistical Machine Learning

Feature Engineering, Model Evaluations

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Machine Learning. Ludovic Samper. September 1st, Antidot. Ludovic Samper (Antidot) Machine Learning September 1st, / 77

4 Bias-Variance for Ridge Regression (24 points)

ECE521 week 3: 23/26 January 2017

Probabilistic Machine Learning. Industrial AI Lab.

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018

HOMEWORK 4: SVMS AND KERNELS

Week 5: Logistic Regression & Neural Networks

Announcements. Proposals graded

Midterm, Fall 2003

Generative Clustering, Topic Modeling, & Bayesian Inference

Jeff Howbert Introduction to Machine Learning Winter

Least Squares Regression

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Homework 2 Solutions Kernel SVM and Perceptron

6.867 Machine Learning

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

CPSC 340: Machine Learning and Data Mining

CS145: INTRODUCTION TO DATA MINING

Introduction to Machine Learning

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Least Squares Regression

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Generative Models for Discrete Data

CS 6375 Machine Learning

Logistic Regression Logistic

Machine Learning 4771

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Language as a Stochastic Process

Machine Learning: Homework Assignment 2 Solutions

Naïve Bayes classification

CSC411 Fall 2018 Homework 5

Machine Learning: Homework 5

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

6.867 Machine Learning

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

CSC 411: Lecture 04: Logistic Regression

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

FINAL: CS 6375 (Machine Learning) Fall 2014

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

COS513 LECTURE 8 STATISTICAL CONCEPTS

HOMEWORK #4: LOGISTIC REGRESSION

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

10-701/ Machine Learning, Fall

Stat 502X Exam 2 Spring 2014

Linear & nonlinear classifiers

Transcription:

10-701 Machine Learning: Assignment 1 Due on Februrary 0, 014 at 1 noon Barnabas Poczos, Aarti Singh Instructions: Failure to follow these directions may result in loss of points. Your solutions for this assignment need to be in a pdf format and should be submitted to the blackboard and a webpage (to be specified later) for peer-reviewing. For the programming question, your code should be well-documented, so a TA can understand what is happening. DO NOT include any identification (your name, andrew id, or email) in both the content and filename of your submission. MLE, MAP, Concentration (Pengtao) 1. MLE of Uniform Distributions [5 pts] Given a set of i.i.d samples X 1,..., X n Uniform(0, θ), find the imum likelihood estimator of θ. (a) Write down the likelihood function (3 pts) (b) Find the imum likelihood estimator ( pts). Concentration [5 pts] The instructors would like to know what percentage of the students like the Introduction to Machine Learning course. Let this unknown but hopefully very close to 1 quantity be denoted by µ. To estimate µ, the instructors created an anonymous survey which contains this question: Do you like the Intro to ML course? Yes or No Each student can only answer this question once, and we assume that the distribution of the answers is i.i.d. (a) What is the MLE estimation of µ? (1 pts) (b) Let the above estimator be denoted by ˆµ. How many students should the instructors ask if they want the estimated value ˆµ to be so close to the unknown µ such that P( ˆµ µ > 0.1) < 0.05, (4pts) 1

3. MAP of Multinational Distribution [10 pts] You have just got a loaded 6-sided dice from your statistician friend. Unfortunately, he does not remember its exact probability distribution p 1, p,..., p 6. He remembers, however, that he generated the vector (p 1, p,..., p 6 ) from the following Dirichlet distribution. P(p 1, p,..., p 6 ) = Γ( 6 u i) 6 Γ(u i) 6 p ui 1 i δ( 6 p i 1), where he chose u i = i for all i = 1,..., 6. Here Γ denotes the gamma function, and δ is the Dirac delta. To estimate the probabilities p 1, p,..., p 6, you roll the dice 1000 times and then observe that side i occurred n i times ( 6 n i = 1000). (a) Prove that the Dirichlet distribution is conjugate prior for the multinomial distribution. (b) What is the posterior distribution of the side probabilities, P(p 1, p,..., p 6 n 1, n,..., n 6 )? Linear Regression (Dani) 1. Optimal MSE rule [10 pts] Suppose we knew the joint distribution P XY. The optimal rule f : X Y which minimizes the MSE (Mean Square Error) is given as: f = arg min f E[(f(X) Y ) ] Show that f (X) = E[Y X].. Ridge Regression [10 pts] In class, we discussed l penalized linear regression: where X i = [X (1) i... X (p) i ]. β = arg min β (Y i X i β) + λ β a) Show that a closed form expression for the ridge estimator is β = (A A + λi) 1 A Y where A = [X 1 ;... ; X n ] and Y = [Y 1 ;...; Y n ]. b) An advantage of ridge regression is that a unique solution always exists since (A A+λI) is invertible. To be invertible, a matrix needs to be full rank. Argue that (A A + λi) is full rank by characterizing its p eigenvalues in terms of the singular values of A and λ. Logistic Regression (Prashant) 1. Overfitting and Regularized Logistic Regression [10 pts] a) Plot the sigmoid function 1/(1 + e wx ) vs. X R for increasing weight w {1, 5, 100}. A qualitative sketch is enough. Use these plots to argue why a solution with large weights can cause logistic regression to overfit. Page of 6

b) To prevent overfitting, we want the weights to be small. To achieve this, instead of imum conditional likelihood estimation M(C)LE for logistic regression: w 0,...,w d n P (Y i X i, w 0,..., w d ), we can consider imum conditional a posterior M(C)AP estimation: w 0,...,w d where P (w 0,..., w d ) is a prior on the weights. n P (Y i X i, w 0,..., w d )P (w 0,..., w d ) Assuming a standard Gaussian prior N (0, I) for the weight vector, derive the gradient ascent update rules for the weights.. Multi-class Logistic Regression [10 pts] One way to extend logistic regression to multi-class (say K class labels) setting is to consider (K-1) sets of weight vectors and define P (Y = y k X) exp(w k0 + d w ki X i ) for k = 1,..., K 1 a) What model does this imply for P (Y = y K X)? b) What would be the classification rule in this case? c) Draw a set of training data with three labels and the decision boundary resulting from a multi-class logistic regression. (The boundary does not need to be quantitatively correct but should qualitatively depict how a typical boundary from multi-class logistic regression would look like.) Naive Bayes Classifier (Pulkit) 1. Naive Bayes Classification Implementation [5 pts] In this question, you will write a Naive Bayes classifier and verify its performance on a news-group data-set. As you learned in class, Naive Bayes is a simple classification algorithm that makes an assumption about conditional independence of features, but it works quite well in practice. You will implement the Naive Bayes algorithm (Multinomial Model) to classify a news corpus into 0 different categories. Handout - http://www.cs.cmu.edu/~aarti/class/10701_spring14/assignments/homework1.tar You have been provided with the following data files in the handout: train.data - Contains bag-of-words data for each training document. Each row of the file represents the number of occurrences of a particular term in some document. The format of each row is (docid, termid, Count). train.label - Contains a label for each document in the training data. Page 3 of 6

test.data - Contains bag-of-words data for each testing document. The format of this file is the same as that of the train.data file. test.label - Contains a label for each document in the testing data. For this assignment, you need to write code to complete the following functions: logprior(trainlabels) - Computes the log prior of the training data-set. (5 pts) loglikelihood(traindata, trainlabels) - Computes the log likelihood of the training data-set. (7 pts) naivebayesclassify(traindata, trainlabels, testdata) - Classifies the data using the Naive Bayes algorithm. (13 pts) Implementation Notes 1. We compute the log probabilities to prevent numerical underflow when calculating multiplicative probabilities. You may refer to this article on how to perform addition and multiplication in log space.. You may encounter words during classification that you haven t during training. This may be for a particular class or over all. Your code should deal with that. Hint: Laplace Smoothing 3. Be memory efficient and please do not create a document-term-matrix in your code. That would require upwards of 600MB of memory. Due to popular demand, we are allowing the solution to be coded in 3 languages: Octave, Julia, and Python. Octave is an industry standard in numerical computing. Unlike MATLAB, it is an open-source language and has similar capabilities and syntax. Julia is a popular new open-source language developed for numerical and scientific computing was well as beginning effective for general programming purposes. This is the first time this language is being supported in a CMU course. Python is an extremely flexible language and is popular in industry and the data science community. Powerful python libraries would not be available to you. For Octave and Julia, a blank function interface has been provided for you (in the handout). However, for Python, you will need to perform the I/O for the data files and ensure the results are written to the correct output files. Challenge Question This question is not graded, but it is highly recommended that you try it. In the above question, we are using all the terms from the vocabulary to make a prediction. This would lead to a lot of noisy features. Although it seems counter-intuitive, classifiers built from a smaller vocabulary perform better because they generalize better over unseen data. Noisy features that are not well-represented often skew the perceived distribution of words, leading to classification errors. Therefore, the classification can be improved by selecting a subset of extremely effective words. Write a program to select a subset of the words from the vocabulary provided to you and then use this subset to run your naive bayes classification again. Verify changes in accuracy. TF-IDF and Information Theory are good places to start looking. Page 4 of 6

Support Vector Machines (Jit) 1. SVM Matching [15 points] Figure 1 (at the end of this problem) plots SVM decision boundries resulting from using different kernels and/or different slack penalties. In Figure 1, there are two classes of training data, with labels y i { 1, 1}, represented by circles and squares respectively. The SOLID circles and squares represent the Support Vectors. Determine which plot in Figure 1 was generated by each of the following optimization problems. Explain your reasoning for each choice. 1.. s.t. i = 1,, n: ξ i 0 (w x i + b)y i (1 ξ i ) 0 and C = 0.1. s.t. i = 1,, n: ξ i 0 (w x i + b)y i (1 ξ i ) 0 and C = 1. min 1 w w + C n ξ i min 1 n w w + C ξ i (1) 3. s.t. n α iy i = 0; α i 0, i = 1,, n; where K(u, v) = u v + (u v). α i 1 α i α j y i y j K(x i, x j ) () i,j 4. s.t. n α iy i = 0; α i 0, i = 1,, n; where K(u, v) = exp( u v ). α i 1 α i α j y i y j K(x i, x j ) (3) i,j 5. s.t. n α iy i = 0; α i 0, i = 1,, n; where K(u, v) = exp( u v ). α i 1 α i α j y i y j K(x i, x j ) (4) i,j Page 5 of 6

Figure 1: Induced Decision Boundaries Page 6 of 6