Solution of Final Exam : / Machine Learning

Similar documents
10-701/ Machine Learning Mid-term Exam Solution

6.867 Machine learning

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Infinite Sequences and Series

6.3 Testing Series With Positive Terms

Introduction to Machine Learning DIS10

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Expectation-Maximization Algorithm.

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

1 Review of Probability & Statistics

Part I: Covers Sequence through Series Comparison Tests

Optimization Methods MIT 2.098/6.255/ Final exam

Topic 9: Sampling Distributions of Estimators

Chapter 4. Fourier Series

MA131 - Analysis 1. Workbook 3 Sequences II

NUMERICAL METHODS FOR SOLVING EQUATIONS

AAEC/ECON 5126 FINAL EXAM: SOLUTIONS

SNAP Centre Workshop. Basic Algebraic Manipulation

Mixtures of Gaussians and the EM Algorithm

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

Intro to Learning Theory

Support vector machine revisited

Final Review for MATH 3510

Recurrence Relations

MA131 - Analysis 1. Workbook 2 Sequences I

CHAPTER 10 INFINITE SEQUENCES AND SERIES

6.003 Homework #3 Solutions

Statistics 511 Additional Materials

Math 155 (Lecture 3)

Recursive Algorithms. Recurrences. Recursive Algorithms Analysis

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

OPTIMAL ALGORITHMS -- SUPPLEMENTAL NOTES

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

CS284A: Representations and Algorithms in Molecular Biology

Sequences I. Chapter Introduction

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

Sequences, Mathematical Induction, and Recursion. CSE 2353 Discrete Computational Structures Spring 2018

Measures of Spread: Standard Deviation

Math 113 Exam 3 Practice

Statistical Pattern Recognition

6.867 Machine learning, lecture 7 (Jaakkola) 1

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Machine Learning Brett Bernstein

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Math 113 Exam 3 Practice

sin(n) + 2 cos(2n) n 3/2 3 sin(n) 2cos(2n) n 3/2 a n =

Problem Cosider the curve give parametrically as x = si t ad y = + cos t for» t» ß: (a) Describe the path this traverses: Where does it start (whe t =

Chapter 23: Inferences About Means

10.6 ALTERNATING SERIES

Math 61CM - Solutions to homework 3

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

( ) = p and P( i = b) = q.

4.3 Growth Rates of Solutions to Recurrences

15-780: Graduate Artificial Intelligence. Density estimation

MATH 413 FINAL EXAM. f(x) f(y) M x y. x + 1 n

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Algorithms for Clustering

Topic 9: Sampling Distributions of Estimators

Lecture 19: Convergence

Problem Set 4 Due Oct, 12

Mathematical Induction

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Math 113, Calculus II Winter 2007 Final Exam Solutions

Seunghee Ye Ma 8: Week 5 Oct 28

Test One (Answer Key)

Chapter 6 Principles of Data Reduction

Notes on iteration and Newton s method. Iteration

Math 10A final exam, December 16, 2016

CHAPTER I: Vector Spaces

7 Sequences of real numbers

PRACTICE PROBLEMS FOR THE FINAL

University of Colorado Denver Dept. Math. & Stat. Sciences Applied Analysis Preliminary Exam 13 January 2012, 10:00 am 2:00 pm. Good luck!

PROPERTIES OF AN EULER SQUARE

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

Section 4.3. Boolean functions

7.1 Convergence of sequences of random variables

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Vector Quantization: a Limiting Case of EM

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

Chapter 10: Power Series

The Method of Least Squares. To understand least squares fitting of data.

Expectation and Variance of a random variable

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Lecture 2: Monte Carlo Simulation

Real Variables II Homework Set #5

Math 475, Problem Set #12: Answers

Lecture 2: April 3, 2013

Frequentist Inference

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Linear Classifiers III

Math 113 Exam 4 Practice

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

STA Learning Objectives. Population Proportions. Module 10 Comparing Two Proportions. Upon completing this module, you should be able to:

Z ß cos x + si x R du We start with the substitutio u = si(x), so du = cos(x). The itegral becomes but +u we should chage the limits to go with the ew

Chapter 6 Infinite Series

PLEASE MARK YOUR ANSWERS WITH AN X, not a circle! 1. (a) (b) (c) (d) (e) 3. (a) (b) (c) (d) (e) 5. (a) (b) (c) (d) (e) 7. (a) (b) (c) (d) (e)

Transcription:

Solutio of Fial Exam : 10-701/15-781 Machie Learig Fall 2004 Dec. 12th 2004 Your Adrew ID i capital letters: Your full ame: There are 9 questios. Some of them are easy ad some are more difficult. So, if you get stuck o ay oe of the questios, proceed with the rest of the questios ad retur back at the ed if you have time remaiig. The maximum score of the exam is 100 poits If you eed more room to work out your aswer to a questio, use the back of the page ad clearly mark o the frot of the page if we are to look at what s o the back. You should attempt to aswer all of the questios. You may use ay ad all otes, as well as the class textbook. You have 3 hours. Good luck! 1

Problem 1. Assorted Questios ( 16 poits) (a) [ 3.5 pts] Suppose we have a sample of real values, called x 1, x 2,..., x. Each sampled from p.d.f. p(x) which has the followig form: f(x) = { αe αx, if x 0 0, otherwise (1) where α is a ukow parameter. Which oe of the followig expressios is the maximum likelihood estimatio of α? ( Assume that i our sample, all x i are large tha 1. ) 1). log(x i) i=1 2). max i=1 log(xi) 3). log(x i) i=1 4). max i=1 log(xi) 5). x i i=1 6). max i=1 xi 7). 8). x i max i=1 i=1 xi 9). x 2 i i=1 10). max i=1 x2 i 11). x 2 i i=1 12). max i=1 x2 i 13). e x i i=1 14). max i=1 ex i 15). e x i i=1 16). max i=1 ex i Aswer: Choose [7]. 2

(b). [7.5 pts] Suppose that X 1,..., X m are categorical iput attributes ad Y is categorical output attribute. Suppose we pla to lear a decisio tree without pruig, usig the stadard algorithm. b.1 (True or False -1.5 pts ) : If X i ad Y are idepedet i the distributio that geerated this dataset, the X i will ot appear i the decisio tree. Aswer: False (because the attribute may become relevat further dow the tree whe the records are restricted to some value of aother attribute) (e.g. XOR) b.2 (True or False -1.5 pts) : If IG(Y X i ) = 0 accordig to the values of etropy ad coditioal etropy computed from the data, the X i will ot appear i the decisio tree. Aswer: False for same reaso b.3 (True or False -1.5 pts ) : The maximum depth of the decisio tree must be less tha m+1. Aswer: True because the attributes are categorical ad ca each be split oly oce b.4 (True or False -1.5 pts ) : Suppose data has R records, the maximum depth of the decisio tree must be less tha 1 + log 2 R Aswer: False because the tree may be ubalaced b.5 (True or False -1.5 pts) : Suppose oe of the attributes has R distict values, ad it has a uique value i each record. The the decisio tree will certaily have depth 0 or 1 (i.e. will be a sigle ode, or else a root ode directly coected to a set of leaves) Aswer: True because that attribute will have perfect iformatio gai. If a attribute has perfect iformatio gai it must split the records ito pure buckets which ca be split o more. 3

(c) [5 pts] Suppose you have this data set with oe real-valued iput ad oe real-valued output: x y 0 2 2 2 3 1 (c.1) What is the mea squared leave oe out cross validatio error of usig liear regressio? (i.e. the mode is y = β 0 + β 1 x + oise) Aswer: 2 2 +(2/3) 2 +1 2 3 = 49/27 (c.2) Suppose we use a trivial algorithm of predictig a costat y = c. What is the mea squared leave oe out error i this case? ( Assume c is leared from the o-left-out data poits.) Aswer: 0.5 2 +0.5 2 +1 2 3 = 1/2 4

Problem 2. Bayes Rule ad Bayes Classifiers ( 12 poits) Suppose you are give the followig set of data with three Boolea iput variables a, b, ad c, ad a sigle Boolea output variable K. a b c K 1 0 1 1 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 For parts (a) ad (b), assume we are usig a aive Bayes classifier to predict the value of K from the values of the other variables. (a) [1.5 pts] Accordig to the aive Bayes classifier, what is P (K = 1 a = 1 b = 1 c = 0)? Aswer: 1/2. P (K = 1 a = 1 b = 1 c = 0) = P (K = 1 a = 1 b = 1 c = 0)/P (a = 1 b = 1 c = 0) = P (K= 1) P (a = 1 K = 1) P (b = 1 K = 1) P (c = 0 K = 1)/ P (a = 1 b = 1 c = 0 K = 1) + P (a = 1 b = 1 c = 0 K = 0). (b) [1.5 pts] Accordig to the aive Bayes classifier, what is P (K = 0 a = 1 b = 1)? Aswer: 2/3. P (K = 0 a = 1 b = 1) = P (K = 0 a = 1 b = 1)/P (a = 1 b = 1) = P (K= 0) P (a = 1 K = 0) P (b = 1 K = 0)/ P (a = 1 b = 1 K = 1) + P (a = 1 b = 1 K = 0). 5

Now, suppose we are usig a joit Bayes classifier to predict the value of K from the values of the other variables. (c) [1.5 pts] Accordig to the joit Bayes classifier, what is P (K = 1 a = 1 b = 1 c = 0)? Aswer: 0. Let um(x) be the umber of records i our data matchig X. The we have P (K = 1 a = 1 b = 1 c = 0) = um(k (d) [1.5 pts] Accordig to the joit Bayes classifier, what is P (K = 0 a = 1 b = 1)? Aswer: 1/2. P (K = 0 a = 1 b = 1) = um(k = 0 a = 1 b = 1)/um(a = 1 b = 1) = 1/2. I a urelated example, imagie we have three variables X, Y, ad Z. (e) [2 pts] Imagie I tell you the followig: P (Z X) = 0.7 P (Z Y ) = 0.4 Do you have eough iformatio to compute P (Z X Y )? If ot, write ot eough ifo. If so, compute the value of P (Z X Y ) from the above iformatio. Aswer: Not eough ifo. 6

(f) [2 pts] Istead, imagie I tell you the followig: P (Z X) = 0.7 P (Z Y ) = 0.4 P (X) = 0.3 P (Y ) = 0.5 Do you ow have eough iformatio to compute P (Z X Y )? If ot, write ot eough ifo. If so, compute the value of P (Z X Y ) from the above iformatio. Aswer: Not eough ifo. (g) [2 pts] Istead, imagie I tell you the followig (falsifyig my earlier statemets): P (Z X) = 0.2 P (X) = 0.3 P (Y ) = 1 Do you ow have eough iformatio to compute P (Z X Y )? If ot, write ot eough ifo. If so, compute the value of P (Z X Y ) from the above iformatio. Aswer: 2/3. P (Z X Y ) = P (Z X) sice P (Y ) = 1. I this case, P (Z X Y ) = P (Z X)/P (X) = 0.2/0.3 = 2/3. 7

Problem 3. SVM ( 9 poits) (a) (True/False - 1 pt ) Support vector machies, like logistic regressio models, give a probability distributio over the possible labels give a iput example. Aswer: False (b) (True/False - 1 pt ) We would expect the support vectors to remai the same i geeral as we move from a liear kerel to higher order polyomial kerels. Aswer: False ( There are o guaratees that the support vectors remai the same. The feature vectors correspodig to polyomial kerels are o-liear fuctios of the origial iput vectors ad thus the support poits for maximum margi separatio i the feature space ca be quite differet. ) (c) (True/False - 1 pt ) The maximum margi decisio boudaries that support vector machies costruct have the lowest geeralizatio error amog all liear classifiers. Aswer: False ( The maximum margi hyperplae is ofte a reasoable choice but it is by o meas optimal i all cases. ) (d) (True/False - 1 pt ) Ay decisio boudary that we get from a geerative model with classcoditioal Gaussia distributios could i priciple be reproduced with a SVM ad a polyomial kerel of degree less tha or equal to three. Aswer: True (A polyomial kerel of degree two suffices to represet ay quadratic decisio boudary such as the oe from the geerative model i questio.) 8

(e) (True/False - 1 pts ) The values of the margis obtaied by two differet kerels K 1 (x, x 0 ) ad K 2 (x, x 0 ) o the same traiig set do ot tell us which classifier will perform better o the test set. Aswer: True ( We eed to ormalize the margi for it to be meaigful. For example, a simple scalig of the feature vectors would lead to a larger margi. Such a scalig does ot chage the decisio boudary, however, ad so the larger margi caot directly iform us about geeralizatio. ) (f) ( 2 pts ) What is the leave-oe-out cross-validatio error estimate for maximum margi separatio i the followig figure? (we are askig for a umber) Aswer: 0 Based o the figure we ca see that removig ay sigle poit would ot chace the resultig maximum margi separator. Sice all the poits are iitially classified correctly, the leave-oe-out error is zero. 9

(g) ( 2 pts ) Now let us discuss a SVM classifier usig a secod order polyomial kerel. The first polyomial kerel maps each iput data x to Φ 1 (x) = [x, x 2 ] T. The secod polyomial kerel maps each iput data x to Φ 2 (x) = [2x, 2x 2 ] T. I geeral, is the margi we would attai usig Φ 2 (x) A. ( ) greater B. ( ) equal C. ( ) smaller D. ( ) ay of the above i compariso to the margi resultig from usig Φ 1 (x)? Aswer: A. 10

Problem 4. Istace based learig ( 8 poits) The followig picture shows a dataset with oe real-valued iput x ad oe real-valued output y. There are seve traiig poits. Suppose you are traiig usig kerel regressio usig some uspecified kerel fuctio. The oly thig you kow about the kerel fuctio is that it is a mootoically decreasig fuctio of distace that decays to zero at a distace of 3 uits (ad is strictly greater tha zero at a distace of less tha 3 uits). (a) ( 2 pts ) What is the predicted value of y whe x = 1? Aswer: 1+2+5+6 4 = 3.5 (b) ( 2 pts ) What is the predicted value of y whe x = 3? Aswer: 1+2+5+6+1+5+6 7 = 26/7 11

(c) ( 2 pts ) What is the predicted value of y whe x = 4? Aswer: 1+5+6 3 = 4 (d) ( 2 pts ) What is the predicted value of y whe x = 7? Aswer: 1+5+6 3 = 4 12

Problem 5. HMM ( 12 poits) Cosider the HMM defied by the trasitio ad emissio probabilities i the table below. This HMM has six states (plus a start ad ed states) ad a alphabet with four symbols (A,C, G ad T). Thus, the probability of trasitioig from state S 1 to state S 2 is 1, ad the probability of emittig A while i state S 1 is 0.3. Here is the state diagram: 13

For each of the pairs belows, place <, > or = betwee the right ad left compoets of each pair. ( 2 pts each ): (a) P (O 1 = A, O 2 = C, O 3 = T, O 4 = A, q 1 = S 1, q 2 = S 2 ) P (O 1 = A, O 2 = C, O 3 = T, O 4 = A q 1 = S 1, q 2 = S 2 ) Below we will use a shorteed otatio. Specifically we will us P (A, C, T, A, S 1, S 2 ) istead of P (O 1 = A, O 2 = C, O 3 = T, O 4 = A, q 1 = S 1, q 2 = S 2 ), P (A, C, T, A) istead of P (O 1 = A, O 2 = C, O 3 = T, O 4 = A) ad so forth. Aswer: = P (A, C, T, A, S 1, S 2 ) = P (A, C, T, A S 1, S 2 )P (S 1, S 2 ) = P (A, C, T, A S 1, S 2 ), sice P (S 1, S 2 ) = 1 (b) P (O 1 = A, O 2 = C, O 3 = T, O 4 = A, q 3 = S 3, q 4 = S 4 ) P (O 1 = A, O 2 = C, O 3 = T, O 4 = A q 3 = S 3, q 4 = S 4 ) Aswer: < As i (b), P (A, C, T, A, S 3, S 4 ) = P (A, C, T, A S 3, S 4 )P (S 3, S 4 ) however, sice P (S 3, S 4 ) = 0.3, the the right had side is bigger. (c) P (O 1 = A, O 2 = C, O 3 = T, O 4 = A, q 3 = S 3, q 4 = S 4 ) P (O 1 = A, O 2 = C, O 3 = T, O 4 = A, q 3 = S 5, q 4 = S 6 ) Aswer: < The first two emissios (A ad C) do ot matter sice they are the same. Thus, the right had side traslates to P (O 3 = T, O 4 = A, q 3 = S 3, q 4 = S 4 ) = P (O 3 = T, O 4 = A q 3 = S 3, q 4 = S 4 )P (S 3, S 4 ) = 0.7 0.1 0.3 = 0.021 while the right had side is 0.3 0.2 0.7 = 0.042. 14

(d) P (O 1 = A, O 2 = C, O 3 = T, O 4 = A) P (O 1 = A, O 2 = C, O 3 = T, O 4 = A, q 3 = S 3, q 4 = S 4 ) Aswer: > Here the left had side is: P (A, C, T, A, S 3, S 4 ) + P (A, C, T, A, S 5, S 6 ). The right side of the summatio is the right had side above. Sice the left side of the summatio is greater tha 0, the left had side is greater. (e) P (O 1 = A, O 2 = C, O 3 = T, O 4 = A) P (O 1 = A, O 2 = C, O 3 = T, O 4 = A q 3 = S 3, q 4 = S 4 ) Aswer: < As metioed for (e) the left had side is: P (A, C, T, A, S 3, S 4 )+P (A, C, T, A, S 5, S 6 ) = P (A, C, T, A S 3, S 4 )P (S 3, S 4 )+ P (A, C, T, A S 5, S 6 )P (S 5, S 6 ). Sice P (A, C, T, A S 3, S 4 ) > P (A, C, T, A S 5, S 6 ) the left had side is lower from the right had side. (f) P (O 1 = A, O 2 = C, O 3 = T, O 4 = A) P (O 1 = A, O 2 = T, O 3 = T, O 4 = G) Aswer: < Sice the first ad third letters are the same, we oly eed to worry about the secod ad fourth. The left had side is: 0.1 (0.3 0.1+0.7 0.2) = 0.017 while the right had side is: 0.6 (0.7 0+0.3 0.4) = 0.072. 15

Problem 6. Learig from labeled ad ulabeled data ( 10 poits) Cosider the followig figure which cotais labeled (class 1 black circles class 2 hollow circles) ad ulabeled (squares) data. We would like to use two methods discussed i class (re-weightig ad co-traiig) i order to utilize the ulabeled data whe traiig a Gaussia classifier. (a) ( 2 pts ) How ca we use co-traiig i this case (what are the two classifiers)? Aswer: Co-traiig partitios thew feature space ito two separate sets ad uses these sets to costruct idepedet classifiers. Here, the most atural way is to use oe classifier (a Gaussia) for the x axis ad the secod (aother Gaussia) usig the y axis. 16

(b) We would like to use re-weightig of ulabeled data to improve the classificatio performace. Reweightig will be doe by placig a the dashed circle o each of the labeled data poits ad coutig the umber of ulabeled data poits i that circle. Next, a Gaussia classifier is ru with the ew weights computed. (b.1). ( 2 pts ) To what class (hollow circles or full circles) would we assig the ulabeled poit A is we were traiig a Gaussia classifier usig oly the labeled data poits (with o re-weightig)? Aswer: Hollow class. Note that the hollow poits are much more spread out ad so the Gaussia leared for them will have a higher variace. (b.2). ( 2 pts ) To what class (hollow circles or full circles) would we assig the ulabeled poit A is we were traiig a classifier usig the re-weightig procedure described above? Aswer: Agai, the hollow class. Re-weightig will ot chage the result sice it will be doe idepedetly for each of the two classes, ad will produce very similar class ceters to the oes i 1 above. 17

(c) ( 4 pts ) Whe we hadle a polyomial regressio problem, we would like to decide what degree of polyomial to use i order to fit a test set. The table below describes the dis-agreemet betwee the differet polyomials o ulabeled data ad also the disagreemet with the labeled data. Based o the method preseted i class, which polyomial should we chose for this data? Which of the two tables do you prefer? Aswer: The degree we would select is 3. Based o the classificatio accuracy, it is beeficial to use higher degree polyomials. However, as we said i class these might overfit. Oe way to test if they do or do t is to check cosistecy o ulabeled data by requirig that the triagle iequality will hold for the selected degree. For a third degree this is ideed the case sice u(2, 3) = 0.2 l(2) + l(3) = 0.2 + 0.1 (where u(2, 3) is the disagreemet betwee the secod ad third degree polyomials o the ulabeled data ad l(2) is the disagreemet betwee degree 2 ad the labeled data). Similarly, u(1, 3) = 0.5 l(1) + l(3) = 0.4 + 0.1. I cotrast, this does ot hold for a fourth degree polyomial sice u(3, 4) = 0.2 > l(3) + l(4) = 0.1. 18

Problem 7. Bayes Net Iferece ( 10 poits) For (a) through (c), compute the followig probabilities from the Bayes et below. Hit: These examples have bee desiged so that oe of the calculatios should take you loger tha a few miutes. If you fid yourself doig dozes of calculatios o a questio sit back ad look for shortcuts. This ca be doe easily if you otice a certai special property of the umbers o this diagram. (a) ( 2 pts ) P (A B) = Aswer: 3/8. P (A B) = P (A B)/P (B) = P (B A) P (A)/(P (B A) P (A)+P (B A) P ( A)) = 0.21/(0.21+0.35) = 3/8. 19

(b) ( 2 pts ) P (B D) = Aswer: 0.56. P (D C) = P (D C) so D is idepedet of C, ad is ot ifluecig the Bayes et. So P (B D) = P (B), which we calculated i (a) to be 0.56. (c) ( 2 pts ) P (C B) = Aswer: 5/11. P (C B) = (P (A B C) + P ( A B C))/P (B) = (P (A) P (B A) P (C A) + P ( A) P (B A) P (C A))/P (B) = (0.042 + 0.21)/0.56 = 5/11. 20

For (d) through (g), idicate whether the give statemet is TRUE or FALSE i the Bayes et give below. (d) [ T/F - ( 1 pt ) ] I<A, {}, E > Aswer: TRUE. (e) [ T/F - ( 1 pt )] I<A, G, E > Aswer: FALSE. (f) [T/F - ( 1 pt )] I<C, {A, G}, F > Aswer: FALSE. (g) [T/F - ( 1 pt )] I<B, {C, E}, F > Aswer: FALSE. 21

Problem 8. Bayes Nets II ( 12 poits) (a) (4 poits) Suppose we use a aive Bayes classifier to lear a classifier for y = A B, where A, B are idepedet of each other boolea radom variables with P (A) = 0.4, P (B) = 0.5. Draw the Bayes et that represets the idepedece assumptios of our classifier ad fill i the probability tables for the et. Aswer: I computig the probabilities for the Bayes et we use the followig Boolea table with correspodig probabilities for each row: A B y P 0 0 0 0.6*0.5=0.3 0 1 0 0.6*0.5=0.3 1 0 0 0.4*0.5=0.2 1 1 1 0.4*0.5=0.2 Usig the table we ca compute the probabilities for the Bayes et: P (y) = 0.2 P (B y) = P (B,y) P (y) = 1 P (B y) = P (B, y) P ( y) = 0.3 0.8 = 0.375 P (A y) = P (A,y) P (y) = 1 P (A y) = P (A, y) P ( y) = 0.2 0.8 = 0.25 22

(b) (8 poits) Cosider a robot operatig i the two-cell gridworld show below. Suppose the robot is iitially i the cell C 1. At ay poit of time the robot ca execute ay of the two actios: A 1 ad A 2. A 1 is to move to a eighborig cell. If the robot is i C 1 the actio A 1 succeeds (moves the robot ito C 2 ) with the probability 0.9 ad fails (leaves the robot i C 1 ) with the probability 0.1. If the robot is i C 2 the actio A 1 succeeds (moves the robot ito C 1 ) with the probability 0.8 ad fails (leaves the robot i C 2 ) with the probability 0.2. The actio A 2 is to stay i the same cell, ad whe executed it keeps the robot i the same cell with probability 1. The first actio the robot executes is chose at radom (with a equal probability betwee A 1 ad A 2 ). Afterwards, the robot alterates the actios it executes. (for example, if the robot executed actio A 1 first, the the sequece of actios is A 1, A 2, A 1, A 2,... ). Aswer the followig questios. (b.1) (4 poits) Draw the Bayes et that represets the cell the robot is i durig the first two actios the robot executes (e.g, iitial cell, the cell after the first actio ad the cell after the secod actio) ad fill i the probability tables. (Hit: The Bayes et should have five variables: q 1 - the iitial cell, q 2, q 3 - the cell after the first ad the secod actio, respectively, a 1, a 2 - the first ad the secod actio, respectively). Aswer: 23

(b.2) (4 poits) Suppose you were told that the first actio the robot executes is A 1. What is the probability that the robot will appear i cell C 1 after it executes close to ifiitely may actios? Aswer: Sice actios alterate ad the first actio is A 1 the trasitio matrix for ay odd actio is: ( ) 0.1 0.9 P (a odd ) =, 0.8 0.2 where p ij elemet is a probability of trasitioig ito cell j as a result of a executio of a odd actio give that the robot is i cell i before executig this actio. ( ) 1 0 Similarly, the trasitio matrix for ay eve actio is: P (a eve ) =. 0 1 If we cosider the pair of actios as oe meta-actio, the we have a Markov chai with the trasitio probability matrix: ( ) 0.1 0.9 P = P (a odd ) P (a eve ) =. 0.8 0.2 At t =, the state distributio satisfies P (q t ) = P T P (q t ). So, P (q t = C 1 ) = 0.1 P (q t 1 = C 1 ) + 0.8 P (q t 1 = C 2 ). Sice there are oly two cells possible we have: P (q t = C 1 ) = 0.1 P (q t 1 = C 1 ) + 0.8 (1 P (q t 1 = C 1 )). Solvig for P (q t = C 1 ) we get: P (q t = C 1 ) = 0.8/1.7 = 0.4706. 24

Problem 9. Markov Decisio Processes (11pts) (a) (8 poits) Cosider the MDP give i the figure below. Assume the discout factor γ = 0.9. The r-values are rewards, while the umbers ext to arrows are probabilities of outcomes. Note that oly state S 1 has two actios. The other states have oly oe actio for each state. (a.1) (4 poits) Write dow the umerical value of J(S 1 ) after the first ad the secod iteratios of Value Iteratio. Iitial value fuctio: J 0 (S 0 ) = 0; J 0 (S 1 ) = 0; J 0 (S 2 ) = 0; J 0 (S 3 ) = 0; J 1 (S 1 ) = J 2 (S 1 ) = Aswer: J 1 (S 1 ) = 2 J 2 (S 1 ) = max(2 + 0.9(0.5 J 1 (S 1 ) + 0.5 J 1 (S 3 )), 2 + 0.9 J 1 (S 2 )) = max(2 + 0.9(0.5 2 + 0.5 10), 2 + 0.9 3) = 7.4 25

(a.2) (4 poits) Write dow the optimal value of state S 1. There are few ways to solve it, ad for oe of them you may fid useful the followig equality: i=0 αi = 1 1 α for ay 0 α < 1. J (S 1 ) = Aswer: It is pretty clear from the give MDP that the optimal policy from S 1 will ivolve tryig to move from S 1 to S 3 as this is the oly state that has a large reward. First, we compute optimal value for S 3 : J (S 3 ) = 10 + 0.9 J (S 3 ) J (S 3 ) = 10 0.1 = 100 We ca ow compute optimal value for S 1 : J (S 1 ) = 2 + 0.9(0.5 J (S 1 ) + 0.5 J (S 3 )) = 2 + 0.9(0.5 J (S 1 ) + 50); Solvig for J (S 1 ) we get: J (S 1 ) = 47 0.55 = 87.45 26

(b) (3 poits) A geeral MDP with N states is guarateed to coverge i the limit for Value Iteratio as log as γ < 1. I practice oe caot perform ifiitely may value iteratios to guaratee covergece. Circle all the statemets below that are true. (1) Ay MDP with N states coverges after N value iteratios for γ = 0.5; Aswer: False (2) Ay MDP coverges after the 1st value iteratio for γ = 1; Aswer: False (3) Ay MDP coverges after the 1st value iteratio for a discout factor γ = 0; Aswer: True, sice all the coverged values will be just immediate rewards. (4) A acyclic MDP with N states coverges after N value iteratios for ay 0 γ 1. Aswer: True, sice there are o cycles ad therefore after each iteratio at least oe state whose value was ot optimal before is guarateed to have its value set to a optimal value (eve whe γ = 1), uless all state values are already coverged. (5) A MDP with N states ad o stochastic actios (that is, each actio has oly oe outcome) coverges after N value iteratios for ay 0 γ < 1. Aswer: False. Cosider a situatio where there are o absorbig goal states. (6) Oe usually stops value iteratios after iteratio k+1 if: max 0 i N 1 J k+1 (S i ) J k (S i ) < ξ, for some small costat ξ > 0. Aswer: True. 27