C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Similar documents
Multilayer Perceptron (MLP)

10-701/ Machine Learning, Fall 2005 Homework 3

1 Convex Optimization

Lecture 3: Dual problems and Kernels

Support Vector Machines

Support Vector Machines

Which Separator? Spring 1

The exam is closed book, closed notes except your one-page cheat sheet.

Lecture Notes on Linear Regression

EEE 241: Linear Systems

Support Vector Machines

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Week 5: Neural Networks

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Lecture 10 Support Vector Machines II

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

CSE 252C: Computer Vision III

Kernel Methods and SVMs Extension

= = = (a) Use the MATLAB command rref to solve the system. (b) Let A be the coefficient matrix and B be the right-hand side of the system.

Support Vector Machines

SDMML HT MSc Problem Sheet 4

p 1 c 2 + p 2 c 2 + p 3 c p m c 2

Feature Selection: Part 1

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Generalized Linear Methods

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

CSC 411 / CSC D11 / CSC C11

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Linear Feature Engineering 11

Linear Approximation with Regularization and Moving Least Squares

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

C/CS/Phy191 Problem Set 3 Solutions Out: Oct 1, 2008., where ( 00. ), so the overall state of the system is ) ( ( ( ( 00 ± 11 ), Φ ± = 1

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Important Instructions to the Examiners:

APPENDIX A Some Linear Algebra

However, since P is a symmetric idempotent matrix, of P are either 0 or 1 [Eigen-values

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Composite Hypotheses testing

CSE 546 Midterm Exam, Fall 2014(with Solution)

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

Logistic Classifier CISC 5800 Professor Daniel Leeds

LECTURE 9 CANONICAL CORRELATION ANALYSIS

Classification learning II

Natural Language Processing and Information Retrieval

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Generative classification models

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Linear Classification, SVMs and Nearest Neighbors

PHYS 705: Classical Mechanics. Calculus of Variations II

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Ensemble Methods: Boosting

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 13

Singular Value Decomposition: Theory and Applications

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Gaussian Mixture Models

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Homework Assignment 3 Due in class, Thursday October 15

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Boostrapaggregating (Bagging)

Pattern Classification

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Feb 14: Spatial analysis of data fields

Differentiating Gaussian Processes

17 Support Vector Machines

Supporting Information

Maximal Margin Classifier

Linear, affine, and convex sets and hulls In the sequel, unless otherwise specified, X will denote a real vector space.

Nonlinear Classifiers II

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Lecture 12: Discrete Laplacian

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Learning from Data 1 Naive Bayes

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

Solutions to exam in SF1811 Optimization, Jan 14, 2015

ρ some λ THE INVERSE POWER METHOD (or INVERSE ITERATION) , for , or (more usually) to

14 Lagrange Multipliers

The conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Maximum Likelihood Estimation (MLE)

Classification as a Regression Problem

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

1 Matrix representations of canonical matrices

The equation of motion of a dynamical system is given by a set of differential equations. That is (1)

Support Vector Machines CS434

Lecture 20: November 7

8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS

Efficient, General Point Cloud Registration with Kernel Feature Maps

The Geometry of Logit and Probit

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

Relevance Vector Machines Explained

Transcription:

C4B Machne Learnng Answers II.(a) Show that for the logstc sgmod functon dσ(z) dz = σ(z) ( σ(z)) A. Zsserman, Hlary Term 20 Start from the defnton of σ(z) Note that Then σ(z) = σ = dσ(z) dz = + e z e z = + e z + e z e z = σ( σ) ( + e z ) 2 (b) If y {0, }, then the negatve log-lkelhood for logstc regresson tranng s L(w) = N y log σ(w x ) + ( y ) log( σ(w x )) Show that ts gradent has the smple form: dl dw = N (y σ)x and hence derve the update equaton for learnng w usng a steepest descent algorthm.

Remnder wrte ths more compactly as p(y = x;w) = σ(w x) p(y = 0 x;w) = σ(w x) p(y x;w) = ( σ(w x) ) y ( σ(w x) ) ( y) Then the lkelhood (assumng ndependence) s p(y x;w) N ( σ(w x ) ) y ( σ(w x ) ) ( y ) and the negatve log lkelhood s L(w) = N y log σ(w x ) + ( y ) log( σ(w x )) We need to compute dl dw = N = N y d log σ(w x ) dw dσ(w x ) y σ dw + ( y ) d log( σ(w x )) dw ( y ) ( σ) dσ(w x ) dw = N y σ σ( σ)x ( y ) ( σ) x σ( σ) = N y ( σ)x ( y )σx = N (y σ)x To mnmze a cost functon C(w) wth steepest descent, the teratve update s w t+ w t η t w C(w t ) where η s the learnng rate. So n ths case for each data pont x w w + η(y σ(w x ))x 2

2.(a) Show that f the SVM cost functon s wrtten as C(w) = N N λ 2 w 2 + max (0, y f(x )) where f(x ) = w x, then usng usng steepest descent optmzaton, w t+ may be learnt from w t by cyclng through the data wth the followng update rule w t+ ( ηλ)w t + ηy x f y w x < ( ηλ)w t where η s the learnng rate. otherwse Frst, start from standard form for the SVM mn w Then wrte ths as an average w 2 + C N max (0, y f(x )) mn C(w) = λ w 2 w 2 + N N N max (0, y f(x )) = λ N 2 w 2 + max (0, y f(x )) (wth λ = 2/(NC) up to an overall scale of the problem). Now compute the gradent wrt w. For the hnge loss the sub-gradent s y x f y w x < 0 otherwse and for the λ w 2 /2 the gradent s λw. Puttng ths together wth the teratve update rule gves the teratve update w t+ w t η t w C(w t ) w t+ ( ηλ)w t + ηy x f y w x < ( ηλ)w t otherwse 3

(b) Contrast the SVM update rule wth that of the perceptron w w ηsgn(w x )x What are the dfferences, and how do they nfluence the margn? There are two man dfferences: () the condton for the SVM s on whether the data pont volates the margn (y w x < ), whereas for the percepton the condton s on whether the pont s ncorrectly classfed (y w x < 0); () for the perceptron there s no regularzaton, and so no ηλw t term resultng from ths. Note, for the SVM, the ηλw t, whch s added even f the pont s outsde the margn, can decrease w. For the perceptron, nothng s added f the pont s correctly classfed. (c) The perceptron learnng rule can be derved as steepest descent optmzaton of a loss functon. What s the loss functon? max (0, y f(x )) 4

3. A K-class dscrmnant s obtaned by tranng K lnear classfers of the form f k (x) = w k x + b k and assgnng a pont to class C k f f k (x) > f j (x) for all j k. (a) Wrte the equaton of the hyperplane separatng class j and k. Ponts on the hyperplane satsfy w j x + b j = w k x + b k Thus, the equaton s (w j w k ) x + (b j b k ) = 0 (b) If x A and x B are both n the decson regon R j (.e. classfed as class j), then show that any pont on the lne x = λx A + ( λ)x B where 0 λ, s also classfed as class j. For ponts x f j (x) = w j (λx A + ( λ)x B ) + b k and usng the lnearty of the classfer f j (x) = λf j (x A ) + ( λ)f j (x B ) As x A and x B are n regon R j, t follows that f j (x A ) > f k (x A ) and f j (x B ) > f k (x B ) for all k j. Hence f j (x) > f k (x) for all k j, and the result follows. 5

4. A student uses the regresson functon f(x,w) = w 0 + w φ (x) + w 2 φ 2 (x) +... + w M φ M (x) = w Φ(x) (where x s a scalar and f a scalar valued functon) for two possble data sources: (a) A perodc source whch oscllates wth a known perod p. (b) A polynomal of second degree. What are sutable bass functons for each of these sources? Can the student save tme and desgn a sngle set of bass functons φ (x) that wll allow hm/her to model observatons from ether source? (a) A perodc source whch oscllates wth a known perod p. Sutable bass functons are φ (x) = cos( 2πx p ) Recall the trgonometrc cos dentty: φ 2(x) = sn( 2πx p ) cos(a B) = cos A cos B + snasnb so that cos((2π(x θ)/p) may be wrtten as a lnear combnaton cos((2π(x θ)/p) = cos( 2πθ p ) cos(2πx p ) sn(2πθ p ) sn(2πx p ) for any phase θ. (b) A polynomal of second degree. Sutable bass functons are φ (x) = x φ 2 (x) = x 2 If the student smply combnes the two bass sets then, gven suffcent data, the coeffcents of the bass functons that are not relevant for that source should be close to zero. 6

5. The cost functon for rdge regresson s: E(w) = 2 N = Ths has the dual representaton ( y w Φ(x ) ) 2 + λ 2 w 2 E(a) = 2 (y Ka)2 + λ 2 a Ka whereks the N N kernel gram matrx wth entres k(x, x j ) = Φ(x ) Φ(x j ). Show that the vector a that mnmzes Ẽ(a) s gven by a = (K + λi) y Dfferentate w.r.t. a dẽ(a) = K (y Ka) + λka = 0 da and rearrangng, assumng K s full rank, (K + λi)a = y Hence a = (K + λi) y 7

6. Consder the followng 3-dmensonal dataponts: (.3,.6, 2.8), (4.3,.4, 5.8), ( 0.6, 3.7, 0.7), ( 0.4, 3.2, 5.8), (3.3, 0.4, 4.3), ( 0.4, 3., 0.9) The mean and covarance matrx of ths data are 3.7292 3.7083 2.3825 c = (.2500,.6333, 3.3833) S = 3.7083 3.7022 2.4294 2.3825 2.4294 4.374 and the egenvector correspondng to the largest egenvalue s u = (0.593, 0.594, 0.5435) (a) Verfy that Su = λ u where λ = 9.6269. (b) The sum of the egenvalues 3 = =.8028. What fracton of the varance s explaned by the frst prncpal component? The varance s (trace of covarance matrx) N x c 2 = d λ =.8028 N k= The frst prncpal component s x = u (x c), and ts varance s N u N (x c)(x c) u = u Su = u λ u = λ = 9.6269 Hence the proporton of varance s 9.6269 = 0.856 = 8.56%.8028 (c) The projecton of a datapont x onto the frst prncpal component s gven by y = u (x c), and smlarly y 2 = u 2 (x c) for the second. If u 2 = ( 0.3958, 0.3727, 0.8393), calculate the projecton of the frst datapont (.3,.6, 2.8) onto the frst two prncpal components. (x, x 2) = ( 0.2676, 0.528) 8

7. Gven the followng 2D data: x = 3 x 2 = 3 x 3 = 3 x 4 = 3 determne the clusters obtaned by runnng the K-means algorthm, wth K = 2 and the clusters ntalzed as (a) c = x,c 2 = x 4 (b) c = x,c 2 = x 3 (c) c = x,c 2 = x 2 (d) c = x +x 4,c 2 2 = x 2+x 3 2 (a) 4 3 2 (b) (c) (d) 9

8. Consder a GMM n whch all the K mxture components have the same covarance matrx Σ = ǫi where I s the dentty matrx. Show that f ths model s ftted usng the EM algorthm, then n the lmt that ǫ 0 the algorthm s equvalent to K-means clusterng. (Hnt, compute the responsbltes, γ k, for ths lmt). If Σ = ǫi, then Σ = ǫ I and N(x µ,σ) e 2ǫ x µ 2 In the Expectaton step of the EM algorthm, the responsbltes are computed as γ k = π kn(x µ k,σ k ) K j= π j N(x µ j,σ j ) = π ke 2ǫ x µ k 2 K j= π j e 2ǫ x µ j 2 As ǫ 0, the term for whch x µ k 2 s smallest wll go to zero more slowly than the rest, and n the lmt γ k for ths k, and zero for all other ks. (assumng π k 0). Ths s a softmax. Thus, responsbltes become the hard assgnment varables r k of K-means. Smlarly n the Maxmzaton step: µ k = N k N N k = N = π k = N k N = γ k x N k N γ k N r k = = r k x 0

9. Descrbe what happens to an EM update f the mean of one of the Gaussan mxture components exactly concdes wth one of the data ponts. Consder D Gaussans N(x µ k, σ k ) = If x concdes wth µ k then N(x µ k, σ k ) = e 2σ 2 (x µ k ) 2 k 2πσk 2πσk Suppose σ k s small k, then γ k for ths pont wll approach unty (from queston 6) n the Expectaton step, and the contrbutons of other ponts to ths component wll also be small. In the Maxmzaton step: σ k = N k N = γ k (x µ k ) 2 so σ k can become smaller stll. As the teratons proceed σ k 0 and the (negatve) log-lkelhood dverges. L(θ) = N ln K π k N(x µ k,σ k ) = k=