NAME: Prof. Ruiz. 1. [5 points] What is the difference between simple random sampling and stratified random sampling?

Similar documents
Fall 2013 Physics 172 Recitation 3 Momentum and Springs

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

Tree Structured Classifier

Hypothesis Tests for One Population Mean

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

NUMBERS, MATHEMATICS AND EQUATIONS

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

Revised 2/07. Projectile Motion

How do scientists measure trees? What is DBH?

Pattern Recognition 2014 Support Vector Machines

Lecture 13: Markov Chain Monte Carlo. Gibbs sampling

Physics 2010 Motion with Constant Acceleration Experiment 1

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

CHAPTER Read Chapter 17, sections 1,2,3. End of Chapter problems: 25

Chapter 3: Cluster Analysis

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Resampling Methods. Chapter 5. Chapter 5 1 / 52

CHM112 Lab Graphing with Excel Grading Rubric

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Hiding in plain sight

Five Whys How To Do It Better

20 Faraday s Law and Maxwell s Extension to Ampere s Law

We can see from the graph above that the intersection is, i.e., [ ).

MATHEMATICS SYLLABUS SECONDARY 5th YEAR

Experiment #3. Graphing with Excel

CHAPTER 4 DIAGNOSTICS FOR INFLUENTIAL OBSERVATIONS

CS:4420 Artificial Intelligence

AP Statistics Notes Unit Two: The Normal Distributions

More Tutorial at

A New Evaluation Measure. J. Joiner and L. Werner. The problems of evaluation and the needed criteria of evaluation

2004 AP CHEMISTRY FREE-RESPONSE QUESTIONS

making triangle (ie same reference angle) ). This is a standard form that will allow us all to have the X= y=

Accelerated Chemistry POGIL: Half-life

Differentiation Applications 1: Related Rates

Trigonometric Ratios Unit 5 Tentative TEST date

CESAR Science Case The differential rotation of the Sun and its Chromosphere. Introduction. Material that is necessary during the laboratory

Public Key Cryptography. Tim van der Horst & Kent Seamons

ALE 21. Gibbs Free Energy. At what temperature does the spontaneity of a reaction change?

Name AP CHEM / / Chapter 1 Chemical Foundations

DINGWALL ACADEMY NATIONAL QUALIFICATIONS. Mathematics Higher Prelim Examination 2010/2011 Paper 1 Assessing Units 1 & 2.

Lab #3: Pendulum Period and Proportionalities

If (IV) is (increased, decreased, changed), then (DV) will (increase, decrease, change) because (reason based on prior research).

, which yields. where z1. and z2

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

CONSTRUCTING STATECHART DIAGRAMS

NAME TEMPERATURE AND HUMIDITY. I. Introduction

ENSC Discrete Time Systems. Project Outline. Semester

Unit 14 Thermochemistry Notes

COMP 551 Applied Machine Learning Lecture 4: Linear classification

[COLLEGE ALGEBRA EXAM I REVIEW TOPICS] ( u s e t h i s t o m a k e s u r e y o u a r e r e a d y )

SUMMER REV: Half-Life DUE DATE: JULY 2 nd

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

ELECTRON CYCLOTRON HEATING OF AN ANISOTROPIC PLASMA. December 4, PLP No. 322

1 PreCalculus AP Unit G Rotational Trig (MCR) Name:

Lesson Plan. Recode: They will do a graphic organizer to sequence the steps of scientific method.

Function notation & composite functions Factoring Dividing polynomials Remainder theorem & factor property

EASTERN ARIZONA COLLEGE Introduction to Statistics

Math Foundations 20 Work Plan

Thermodynamics and Equilibrium

Lab 1 The Scientific Method

Admin. MDP Search Trees. Optimal Quantities. Reinforcement Learning

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

MATCHING TECHNIQUES Technical Track Session VI Céline Ferré The World Bank

5 th grade Common Core Standards

AP Statistics Notes Unit Five: Randomness and Probability

Name: Block: Date: Science 10: The Great Geyser Experiment A controlled experiment

IAML: Support Vector Machines

READING STATECHART DIAGRAMS

Determining Optimum Path in Synthesis of Organic Compounds using Branch and Bound Algorithm

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving.

39th International Physics Olympiad - Hanoi - Vietnam Theoretical Problem No. 1 /Solution. Solution

Medium Scale Integrated (MSI) devices [Sections 2.9 and 2.10]

SIZE BIAS IN LINE TRANSECT SAMPLING: A FIELD TEST. Mark C. Otto Statistics Research Division, Bureau of the Census Washington, D.C , U.S.A.

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

BASD HIGH SCHOOL FORMAL LAB REPORT

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

The influence of a semi-infinite atmosphere on solar oscillations

Exponential Functions, Growth and Decay

The Law of Total Probability, Bayes Rule, and Random Variables (Oh My!)

ANSWER KEY FOR MATH 10 SAMPLE EXAMINATION. Instructions: If asked to label the axes please use real world (contextual) labels

" 1 = # $H vap. Chapter 3 Problems

Writing Guidelines. (Updated: November 25, 2009) Forwards

Math 105: Review for Exam I - Solutions

/ / Chemistry. Chapter 1 Chemical Foundations

OTHER USES OF THE ICRH COUPL ING CO IL. November 1975

Checking the resolved resonance region in EXFOR database

Section I5: Feedback in Operational Amplifiers

Lecture 6: Phase Space and Damped Oscillations

Introduction to Smith Charts

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

Determining the Accuracy of Modal Parameter Estimation Methods

Maximum A Posteriori (MAP) CS 109 Lecture 22 May 16th, 2016

How topics involving numbers are taught within Budehaven Community School

Misc. ArcMap Stuff Andrew Phay

Exercise 3 Identification of parameters of the vibrating system with one degree of freedom

Administrativia. Assignment 1 due thursday 9/23/2004 BEFORE midnight. Midterm exam 10/07/2003 in class. CS 460, Sessions 8-9 1

Transcription:

CS4445 ata Mining and Kwledge iscery in atabases. B Term 2014 Exam 1 Nember 24, 2014 Prf. Carlina Ruiz epartment f Cmputer Science Wrcester Plytechnic Institute NAME: Prf. Ruiz Prblem I: Prblem II: Prblem III: Prblem IV: (/10 pints) ata Preprcessing (/15 pints) Mdel Ealuatin (/30 pints) ecisin Trees (/45 pints) Baian Mdels Instructins: TOTAL SCORE: (/100 pints) Shw yur wrk and justify yur answers Use the space prided t write yur answers Ask in case f dubt Prblem I. ata Preprcessing [10 pints] 1. [5 pints] What is the difference between simple randm sampling and stratified randm sampling? Slutin: (Taken frm the slutins t Exam 1 CS4445 B Term 2012) Simple randm sampling draws data instances at randm using a unifrm distributin (that is each data instance is equally likely t be chsen), while stratified randm sampling draws data instances at randm accrding t the distributin f the target attribute (s that the subsample preseres the distributin f the target attribute). 2. [5 pints] Assume that A is a minal attribute, ther than the target attribute. Cnsider a missing alue fr this attribute A. a. Briefly describe a pssible unsuperised methd t replace this missing alue. Slutin: Replace the missing alue with the mde f attribute A. [This is an unsuperised methd because it desn t use the target attribute at all.] b. Briefly describe a pssible superised methd t replace this missing alue. Slutin: Replace the missing alue with the mde f attribute A n the data instances that hae the same classificatin (target alue) f the instance that cntains the missing alue. [This is superised methd because it uses the target attribute t mdify A.] Page 1 f 6

Prblem II. Mdel Ealuatin [15 pints]. 1. [10 pints] Explain hw n-fld crss alidatin wrks (t make it easier t explain, use n=10). w is the accuracy reprted by this ealuatin methd cmputed? Slutin: (Taken frm the slutins t Exam 1 CS4445 term 2003. Against my wn suggestin abe, I will explain the prcedure fr a general n rather than using n=10) Partitin the input data int n flds (i.e., mutually disjint and cllectiely exhaustie parts), apprximately f the same size, at randm using stratificatin. Let's dete thse flds as F1,F2,..., Fn. Nw, perfrm the fllwing prcess: Fr i := 1 t n d - cnstruct mdel Mi using as training data the unin f all flds except fr Fi. That is, the unin f F1,..., F(i-1), F(1+1),..., Fn - test mdel Mi n fld Fi, and recrd the accuracy (r the errr) btained. End Fr Return the aerage f the accuracies (r f the errrs) f all the mdels Mi. 2. [5 pints] Briefly describe an adantage and a disadantage f this ealuatin methd. Slutin: [Althugh perfrming n-fld crss alidatins has seeral adantages, we discuss just ne f them here as that s all is required by the prblem statement.] Adantage: This systematic prcedure allws each and eery instance in the dataset t be part f the training set in sme experiments (n-1 t be precise) and f the test set in ther experiments (1 t be precise). isadantage: The prcess might take a lng time, as n mdels are cnstructed and tested. Page 2 f 6

Prblem III. ecisin Trees [30 pints] An alternatie metric fr selecting the best attribute t split a de in a decisin tree is the Gini metric. Belw are sme facts abut the Gini metric. The frmulas fr the Entrpy and fr the Gini metrics are: c Entrpy(t) = p(i t) lg 2 p(i t) i=1 and Gini(t) = 1 [p(i t)] 2 where c is the number f classes (i.e., alues f the target attribute) and p(i t) is the relatie frequency f class i at de t. As with Entrpy, the Gini alue f an attribute is the weighted sum f the Gini alues f each f the attribute alues. As with Entrpy, the attribute with the lwest Gini alue is selected t split the tree de. c i=1 Cnsider the fllwing dataset f 10 data instances. Assume that efaulted Brrwer is the target attribute. me Owner () Marital Status (M) Annual Incme (A) efaulted Brrwer () dirced >85K dirced >85K married >85K married >85K married 85K married 85K single >85K single 85K single 85K single >85K The Gini alues f the predicting attributes fr this dataset are: Gini alue f use Owner is 0.3428 Gini alue f Marital Status is 0.3 Gini alue f Annual Incme is 0.4166 1. [10 pints] Using the frmula fr Gini, shw that the Gini alue f Annual Incme is indeed 0.4166. Shw yu wrk (please use the tatin [# f s, # f es] t neatly summarize the cunts). Slutin: The [, ] cunts fr 85K are [3,1] and the [,] cunts fr >85K are [4,2]. Gini(A) = Gini([3,1],[4,2]) = (4/10)*Gini([3,1]) + (6/10)*Gini([4,2]) = (4/10)*[1 [(3/4)^2 + (1/4)^2]] + (6/10)*[1 [(4/6)^2 + (2/6)^2]] = (4/10)*[1 [(9/16) + (1/16)]] + (6/10)*[1 [(16/36) + (4/36)]] = (4/10)*[1 (10/16)] + (6/10)*[1 (20/36)] = (4/10)*(6/16) + (6/10)*(16/36) = (3/20)+(4/15)=0.4166 Page 3 f 6

2. [20 pints] Cnstruct the full I3 decisin tree using Gini t rank the predicting attributes (me Owner, Marital Status, Annual Incme) with respect t the target/classificatin attribute (efaulted Brrwer). Fr the rt de, yu can assume that the Gini alue f use Owner is 0.3428, the Gini alue f Marital Status is 0.3, and the Gini alue f Annual Incme is 0.4166 withut calculating these alues explicitly. Fr des ther than the rt, shw all the steps f yur Gini calculatins. Make sure t shw yur wrk. Slutin: Since Marital Status has the lwest Gini alue, it is chsen t split the rt de. Fr M=dirced (left-mst child), has / cunt [1,1]. By simple inspectin, me Owner perfectly splits this de, while Annual Incme desn t split it. ence, we select me Owner t split this de. Fr M=married (middle child), the de is hmgeus [4,0], s it is cnerted int a leaf. Fr M=single (right-mst child), the de is hetergeneus [2,2] and neither me Owner r Annual Incme splits it perfectly well. S we calculate the Gini alue f these tw attributes fr this de: Gini() = Gini([1,2],[1,0]) = (3/4)*Gini([1,2]) + (1/4)*Gini([1,0]) = (3/4)*[1-[(1/3)^2 + (2/3)^2]]+0 = (3/4)*[1 (5/9)] = (3/4)*[4/9] = 1/3 = 0.33 Gini(A) = Gini([1,1],[1,1]) = (2/4)*Gini([1,1]) + (2/4)*Gini([1,1]) = [1-[(1/2)^2 + (1/2)^2]] = [1 (1/2)] = 1/2 = 0.5 ence, me Owner is chsen t split this de. The = child de is hmgeneus s we make it int a leaf. The = child de is hetergeneus, s we split it with the nly remaining attribute aailable in that subtree, namely A. One f children f A is still hetergeneus [1,1], but since there are mre attributes aailable t split it, we cnert it int a leaf and break the tie chsing the first class alue listed n the dataset, namely, fllwing Weka s cnentin. [7,3] [1,1] dirced M married [4,0] single [2,2] [0,1] [1,0] [1,2] [1,0] A 85 >85 [1,1] [0,1] Page 4 f 6

Prblem IV. Baian Mdels [45 pints] Cnsider the fllwing dataset, where efaulted Brrwer is the target attribute: me Owner () Marital Status (M) Annual Incme (A) efaulted Brrwer () dirced >85K married >85K married >85K married 85K married 85K single >85K single 85K dirced >85K single 85K single >85K 1. Naïe Ba: a. [5 pints] isplay the tplgy f the naïe Ba graph fr the training dataset. [10 pints] Cmpute all f the Cnditinal Prbability Tables (CPTs) in the graph. Shw yur wrk neatly. Slutin: (7+1)/12 (3+1)/12 M A (4+1)/9 (3+1)/9 (3+1)/5 (0+1)/5 M dirced married single (1+1)/10 (4+1)/10 (2+1)/10 (1+1)/6 (0+1)/6 (2+1)/6 A 85 >85 (3+1)/9 (4+1)/9 (1+1)/5 (2+1)/5 b. [15 pints] etermine the efaulted Brrwer alue that this naïe Ba mdel predicts fr the test data instance: me Owner =, Marital Status = single and Annual Incme 85K (let s abbreiate this as: =, M=single and A 85K). Shw yur wrk in detail. Slutin: The predictin f the Naïe Ba mdel fr this data instance is: argmax P(= = & M=single & A 85K) = argmax P(= & M=single & A 85K =) P(=) = argmax P(= =) P(M=single =) P(A 85K =) P(=) because f the naïe assumptin Fr = : (4/9) (3/10) (4/9) (8/12) = 16/405 = 0.0395 Fr = : (1/5) (3/6) (2/5) (4/12) = 1/75 = 0.013 Since = gets the highest prbability, then the naïe Ba mdel predicts. Page 5 f 6

2. Cnsider the fllwing Baian net fr the abe dataset: M A We want t determine the efaulted Brrwer alue that this Baian net predicts fr the test data instance: =, M=single and A 85K. One can pre (but yu dn t need t d s) that the predictin f this Baian net will be the fllwing: Predicted alue f = = argmax P(= = & M=single & A 85K) = argmax P(= & M=single & A 85K =) P(=) = argmax P(= M=single & A 85K) P(M=single =) P(A 85K) P(=) a. [5 pints] Assume that all the prbability alues abe are different frm 0. Simplify the last line f the deriatin abe as much as yu can, eliminating prbability expressins that dn t need t be cnsidered. Explain yur answer. Slutin: Since P(= M=single & A 85K) and P(A 85K) dn t ile =, they wn t affect the result f the argmax. In ther wrds, they are cnstant with respect t. ence, they can be eliminated frm the last line f the deriatin abe withut affecting the result: = argmax P(M=single =) P(=) b. [10 pints] Using yur simplified frmula, determine the efaulted Brrwer alue that this Baian net will predict fr this test data. Calculate explicitly nly the entries f the Cnditinal Prbability Tables (CPTs) that yu need in rder t answer this questin. Shw yur wrk. Slutin: argmax P(M=single =) P(=) Fr = : (3/10) (8/12) = 1/5 Fr = : (3/6) (4/12) = 1/6 [Nte that the CPT tables fr and fr M n this Baian net are identical t the nes calculated fr the naïe Ba mdel.] Since = gets the highest prbability, then this Baian net mdel predicts. Page 6 f 6