SDMML HT MSc Problem Sheet 4

Similar documents
Lecture Notes on Linear Regression

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Multilayer neural networks

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Generalized Linear Methods

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

10-701/ Machine Learning, Fall 2005 Homework 3

Classification learning II

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Week 5: Neural Networks

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Multi-layer neural networks

Evaluation of classifiers MLPs

Generative classification models

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Ensemble Methods: Boosting

Multilayer Perceptron (MLP)

Lecture 12: Classification

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

1 Convex Optimization

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Classification as a Regression Problem

Evaluation for sets of classes

CSE 546 Midterm Exam, Fall 2014(with Solution)

Homework Assignment 3 Due in class, Thursday October 15

Learning from Data 1 Naive Bayes

Kernel Methods and SVMs Extension

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

The exam is closed book, closed notes except your one-page cheat sheet.

EEE 241: Linear Systems

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Feature Selection: Part 1

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Boostrapaggregating (Bagging)

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Course 395: Machine Learning - Lectures

Maximum Likelihood Estimation (MLE)

Hidden Markov Models

Statistics MINITAB - Lab 2

Chapter 11: Simple Linear Regression and Correlation

Using deep belief network modelling to characterize differences in brain morphometry in schizophrenia

Semi-Supervised Learning

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Statistical analysis using matlab. HY 439 Presented by: George Fortetsanakis

STAT 405 BIOSTATISTICS (Fall 2016) Handout 15 Introduction to Logistic Regression

Gaussian Mixture Models

Chapter 9: Statistical Inference and the Relationship between Two Variables

Stat 642, Lecture notes for 01/27/ d i = 1 t. n i t nj. n j

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Supporting Information

Module 9. Lecture 6. Duality in Assignment Problems

Marginal Effects in Probit Models: Interpretation and Testing. 1. Interpreting Probit Coefficients

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

Natural Language Processing and Information Retrieval

The Geometry of Logit and Probit

Singular Value Decomposition: Theory and Applications

Logistic Regression Maximum Likelihood Estimation

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Linear Regression Analysis: Terminology and Notation

CSC 411 / CSC D11 / CSC C11

Clustering & Unsupervised Learning

Lecture 10 Support Vector Machines II

Lecture 6: Introduction to Linear Regression

Natural Images, Gaussian Mixtures and Dead Leaves Supplementary Material

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Machine learning: Density estimation

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

XII.3 The EM (Expectation-Maximization) Algorithm

NUMERICAL DIFFERENTIATION

Statistical pattern recognition

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Negative Binomial Regression

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

Linear Feature Engineering 11

Markov Chain Monte Carlo Lecture 6

Composite Hypotheses testing

Outline. Multivariate Parametric Methods. Multivariate Data. Basic Multivariate Statistics. Steven J Zeil

The written Master s Examination

Chapter 13: Multiple Regression

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Lecture 3 Stat102, Spring 2007

Here is the rationale: If X and y have a strong positive relationship to one another, then ( x x) will tend to be positive when ( y y)

Support Vector Machines

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Why Bayesian? 3. Bayes and Normal Models. State of nature: class. Decision rule. Rev. Thomas Bayes ( ) Bayes Theorem (yes, the famous one)

Classification. Representing data: Hypothesis (classifier) Lecture 2, September 14, Reading: Eric CMU,

STAT 3008 Applied Regression Analysis

Neural Networks. Perceptrons and Backpropagation. Silke Bussen-Heyen. 5th of Novemeber Universität Bremen Fachbereich 3. Neural Networks 1 / 17

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

x = , so that calculated

Assortment Optimization under MNL

Logistic Classifier CISC 5800 Professor Daniel Leeds

Probability Density Function Estimation by different Methods

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

Transcription:

SDMML HT 06 - MSc Problem Sheet 4. The recever operatng characterstc ROC curve plots the senstvty aganst the specfcty of a bnary classfer as the threshold for dscrmnaton s vared. Let the data space be R, and denote the class-condtonal denstes wth g 0 x and g x for x R and for the two classes 0 and. Consder a classfer that classfes x as class f x c, where threshold c vares from to +. a Gve expressons for the populaton versons of specfcty and senstvty of ths classfer. b Show that the AUC corresponds to the probablty that X > X 0, where data tems X and X 0 are ndependent and come from classes and 0 respectvely.. -NN rsk n bnary classfcaton Let {X, Y } n = be a tranng dataset where X R p and Y {0, }. We denote by g k x the condtonal densty of X gven Y = k and assume that g k x > 0 for all x R p, and the class probabltes as π k = P Y = k. We further denote q x = P Y = X = x. a Consder the Bayes classfer mnmzng rsk w.r.t. 0/ loss {fx Y }: f Bayes x = arg max k {0,} π k g k x. Wrte the condtonal expected loss P [fx Y X = x] at a gven test pont X = x n terms of q x. [The resultng expresson should depend only on q x]. b The -nearest neghbour -NN classfer assgns to a test data pont x the label of the closest tranng pont;.e. f NN x = y class of nearest neghbour n the tranng set. Gven some test pont X = x and ts nearest neghbour X = x, what s the condtonal expected loss P [f NN X Y X = x, X = x ] of the -NN classfer n terms of q x, q x? c As the number of tranng examples goes to nfnty,.e. n, assume that the tranng data flls the space such that q x q x, x. Gve the lmt as n of P [f NN X Y X = x]. If we denote by R Bayes = P [ Y f Bayes X ] and R NN = P [Y f NN X], show that for suffcently large n R Bayes R NN R Bayes RBayes.. Recall the defnton of a one-hdden layer neural network for bnary classfcaton n the lectures. The objectve functon s L -regularzed log loss: J = y log ŷ + y log ŷ + λ wjl h + wl o = jl l and the network defnton s: m ŷ = s b o + wl o h l, h l = s b h l + l= wth transfer functon sa = +e a. p wjl h x j, j=

a Verfy that the dervatves needed for gradent descent are: J w o l J w h jl = λwl o + ŷ y h l, = = λwjl h + ŷ y wl o h l h l x j. = b Suppose nstead that you have a neural network for bnary classfcaton wth L hdden layers, each hdden layer havng m neurons wth logstc transfer functon. Gve the parameterzaton for each layer, and derve the backpropagaton algorthm to compute the dervatves of the objectve wth respect to the parameters. For smplcty, you can gnore bas terms. 4. In ths queston you wll nvestgate fttng neural networks usng the nnet lbrary n R. We wll tran a neural network to classfy handwrtten dgts 0-9. Download fles usps tranx.data, usps trany.data, usps testx.data, usps testy.data from http://www.stats.ox.ac.uk/ sejdnov/sdmml/data/. Each handwrtten dgt s 6 6 n sze, so that data vectors are p = 56 dmensonal and each entry pxel takes nteger values 0-55. There are 000 dgts 00 dgts of each class n each of the tranng set and test set. You can vew the dgts wth magematrxas.matrxtranx[500,],6,6,col=greyseq0,,length=56 trany[500,] Download the R scrpt nnetusps.r from the course webpage. The scrpt trans a -hdden layer neural network wth S = 0 hdden unts for T = 0 teratons, reports the tranng and test errors, runs t for another 0 teratons, and reports the new tranng and test errors. To make computatons qucker, the scrpt down-samples the tranng set to 00 cases, by usng only one out of every 0 tranng cases. You wll fnd the documentaton for the nnet lbrary useful: http://cran.r-project.org/web/packages/nnet/nnet.pdf. a Edt the scrpt to report the tranng and test error after every teraton of tranng the network. Use networks of sze S = 0 and up to T = 00 teratons. Plot the tranng and test errors as functons of the number of teratons. Dscuss the results and the fgure. b Edt the scrpt to vary the sze of the network, reportng the tranng and test errors for network szes S =,,, 4, 5, 0, 0, 40. Use T = 5 teratons. Plot these as a functon of the network sze. Dscuss the results and the fgure. 5. Consder a bnary classfcaton problem wth Y = {, }. We are at a node t n a decson tree and would lke to splt t based on Gn mpurty. Consder a categorcal attrbute A wth L levels,.e., x A {a, a,..., a L }. For a generc example X, Y reachng node t, denote: p k = P Y = k, k =,, q l = P X A = a l, l =,..., L, p k l = P Y = k X A = a l, k =,, and l =,..., L. Thus, the populaton Gn mpurty s gven by p p. Further, assume N = n examples

{X, Y } n = have reached the node t, and denote N k = { : Y = k}, k =,, { } N l = : X A = a l, l =,..., L, { } N k l = : Y = k and X A = a l, k =,, and l =,..., L. a Assumng data vectors reachng node t are ndependent, explan why N l N = n, N k N = n and N k l N l = n l have respectvely multnomal, bnomal and bnomal dstrbutons wth parameters q l, p k and p k l. b If we splt usng attrbute A and are not usng dummy varables we wll have an L-way splt and the resultng mpurty change wll be Gn = p p L q l p l p l The parameters p k, q l and p k l are unknown, however. The Gn mpurty estmate ˆ Gn s thus computed usng the plug-n estmates ˆp k = N k /N, ˆq l = N l /N and ˆp k l = N k l /N l respectvely. Calculate the expected estmated mpurty change E[ ˆ Gn N = n] between node t and ts L chld-nodes, condtoned on N = n data vectors reachng node t. l= c Suppose the attrbute-levels are actually unnformatve about the class label, so that p k l = p k. Show that, condtoned on N = n, the expected estmated Gn mpurty change s then equal p p L /n. d Is ths attrbute selecton crteron based n favor of attrbutes wth more levels? 6. Download the wne dataset from https://archve.cs.uc.edu/ml/machne-learnng-databases/wne/wne.data and load t usng read.table"wne.data",sep=",". Descrpton of the dataset s gven at https://archve.cs.uc.edu/ml/datasets/wne. a Make a bplot usng the scale=0 opton, and then use the xlabs=as.numerctd$type opton n bplot to label ponts by ther $Type. The output should look lke:

0.4 0. 0.0 0. 0.4 Comp. 4 0 4 V V V8 V5 V0 V9 V7 V V6 V4 V4 V V 0.4 0. 0.0 0. 0.4 4 0 4 Comp. b Now tran a classfcaton tree usng rpart, and relate the decson rule dscovered there to the projectons of the orgnal varable axes dsplayed n the bplot. Gve the plots of the tree as well as of the cross-valdaton results n rpart object usng plotcp. c Now produce a Random Forest ft, calculatng the out-of-bag estmaton error and compare wth the tree analyss. You could start lke: lbraryrandomforest rf <- randomforesttd[,:4],td[,],mportance=true prntrf Use tunerf to fnd an optmal value of mtry, the number of attrbute canddates at each splt. Use varimpplot to determne what are the most mportant varables. Optonal 7. A mxture of experts s an ensemble model n whch a number of experts compete to predct a label. Consder a regresson problem wth dataset {x, y } n = and y R. We have E experts, each assocated wth a parametrzed regresson functon f j x; θ j, for j =,..., E for example, each expert could be a neural network. a A smple mxture of experts model uses as objectve functon Jπ, σ, θ j E j= = log = E j= π j e σ f jx ;θ j y where π = π,..., π E are mxng proportons and σ s a parameter. Relate the objectve functon to the log-lkelhood of a mxture model where each component s a condtonal dstrbuton of Y gven X = x. 4

b Dfferentate the objectve functon wth respect to θ j. Introduce a latent varable z, ndcatng whch expert s responsble for predctng y, and nterpret J θ j n the context of the correspondng EM algorthm. In ths context, one needs to use the generalzed EM algorthm, where n the M-step gradent descent s used to update the expert parameters θ j. c A mxture of experts allows each expert to specalze n predctng the response n a certan part of the data space, wth the overall model havng better predctons than any one of the experts. However to encourage ths specalzaton, t s useful also for the mxng proportons to depend on the data vectors x,.e. to model π j x; φ as a functon of x wth parameters φ. The dea s that ths gatng network controls where each expert specalzes. To ensure E j= π jx; φ =, we can use the softmax nonlnearty: π j x; φ = exph j x; φ j E l= exph lx; φ l where h j x; φ j are parameterzed functons for the gatng network. The prevous generalzed EM algorthm extends to ths scenaro easly. Descrbe what changes have to be made, and derve a gradent descent learnng update for φ j. 5