ECE 275B Homework #2 Due Thursday MIDTERM is Scheduled for Tuesday, February 21, 2012

Similar documents
ECE 275B Homework #2 Due Thursday 2/12/2015. MIDTERM is Scheduled for Thursday, February 19, 2015

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Machine Learning Lecture 5

EM Algorithm II. September 11, 2018

The Expectation-Maximization Algorithm

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

ECE 275A Homework 7 Solutions

Reading Group on Deep Learning Session 1

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Parameter Estimation in the Spatio-Temporal Mixed Effects Model Analysis of Massive Spatio-Temporal Data Sets

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

6.867 Machine Learning

Curve Fitting Re-visited, Bishop1.2.5

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Estimating the parameters of hidden binomial trials by the EM algorithm

Mixture Models and EM

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

ECE531 Lecture 10b: Maximum Likelihood Estimation

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Latent Variable Models and EM algorithm

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Nonparametric Bayesian Methods (Gaussian Processes)

Statistics: Learning models from data

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton

CS281 Section 4: Factor Analysis and PCA

Convergence Rate of Expectation-Maximization

Intro to Linear & Nonlinear Optimization

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Multivariate statistical methods and data mining in particle physics

L11: Pattern recognition principles

STAT 135 Lab 3 Asymptotic MLE and the Method of Moments

Lecture 14. Clustering, K-means, and EM

A minimalist s exposition of EM

Pattern Recognition. Parameter Estimation of Probability Density Functions

ECE 275A Homework 6 Solutions

Parametric Techniques

STAT 730 Chapter 4: Estimation

Introduction to Machine Learning Midterm Exam

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Lecture 7 Introduction to Statistical Decision Theory

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE521 week 3: 23/26 January 2017

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

PARAMETER CONVERGENCE FOR EM AND MM ALGORITHMS

Parametric Techniques Lecture 3

HOMEWORK #4: LOGISTIC REGRESSION

Stochastic Analogues to Deterministic Optimizers

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Linear Regression and Its Applications

Biostat 2065 Analysis of Incomplete Data

HOMEWORK #4: LOGISTIC REGRESSION

Discussion of Maximization by Parts in Likelihood Inference

Disentangling Gaussians

Mixtures of Gaussians. Sargur Srihari

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Maximum Likelihood Estimation. only training data is available to design a classifier

Discriminative Direction for Kernel Classifiers

STA 4273H: Statistical Machine Learning

Lecture 5: Logistic Regression. Neural Networks

Chapter 9. Non-Parametric Density Function Estimation

Neural Network Training

Label Switching and Its Simple Solutions for Frequentist Mixture Models

Accelerating the EM Algorithm for Mixture Density Estimation

9 Multi-Model State Estimation

Midterm exam CS 189/289, Fall 2015

MIXTURE MODELS AND EM

10708 Graphical Models: Homework 2

CS534 Machine Learning - Spring Final Exam

Introduction to Machine Learning

CPSC 540: Machine Learning

6.867 Machine Learning

EM Algorithm. Expectation-maximization (EM) algorithm.

Algorithmisches Lernen/Machine Learning

11. Learning graphical models

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Expectation Propagation Algorithm

ECE 275B Homework # 1 Solutions Winter 2018

Estimating Gaussian Mixture Densities with EM A Tutorial

EBEM: An Entropy-based EM Algorithm for Gaussian Mixture Models

Labor-Supply Shifts and Economic Fluctuations. Technical Appendix

A Note on the Expectation-Maximization (EM) Algorithm

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017

Recent Advances in Bayesian Inference Techniques

CPSC 540: Machine Learning

ECE 275B Homework # 1 Solutions Version Winter 2015

Chapter 9. Non-Parametric Density Function Estimation

STA 4273H: Statistical Machine Learning

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

Transcription:

Reading ECE 275B Homework #2 Due Thursday 2-16-12 MIDTERM is Scheduled for Tuesday, February 21, 2012 Read and understand the Newton-Raphson and Method of Scores MLE procedures given in Kay, Example 7.11, pp. 178-82. Also read and understand the material in Section 7.8 of Kay. Read and understand the Expectation Maximization (EM) Algorithm described in Moon and stirling, Sections 17.1-17.6. The following papers are highly recommended: Eliminating Multiple Root Problems in Estimation, C.G. Small, J. Wang, and Z. Yang, Statistical Science, Vol. 15, No. 4 (Nov., 2000), pp. 313-332. Maximum Likelihood from Incomplete Data via the EM Algorithm, A.P. Dempster; N.M. Laird, & D.B. Rubin, Journal of the Royal Statistical Society. Series B (Methodological),Vol. 39, No. 1, (1977), pp. 1-38. Highly Recommended. On the Convergence Properties of the EM Algorithm, C.F. Jeff Wu, The Annals of Statistics, Vol. 11, No. 1 (Mar., 1983), pp. 95-103. On the Convergence of the EM Algorithm, R.A. Boyles, Journal of the Royal Statistical Society. Series B (Methodological), Vol. 45, No. 1, 1983, pp. 47-50. Mixture Densities, Maximum Likelihood and the Em Algorithm, R.A. Redner & H.F. Walker, SIAM Review, Vol. 26, No. 2 (Apr., 1984), pp. 195-239. Highly Recommended. Direct Calculation of the Information Matrix via the EM Algorithm, D. Oakes, Journal of the Royal Statistical Society. Series B (Statistical Methodology), Vol. 61, No. 2 (1999), pp. 479-482. The Variational Approximation for Bayesian Inference: Life After the EM Algorithm, D.G. Tzikas, A.C. Likas, and N.P. Galatsanos, IEEE Signal Processing Magazine, November 2008, pp. 131-146. Comments on the Above Cited Papers As mentioned in class, For highly nonlinear problems of the type considered by Kay and for parametric mixture distributions 1 there will generally be multiple roots to the likelihood equation corresponding to multiple local maxima to the likelihood function. The question of how to deal with this problem, and in particular the question of how to track the zero of 1 In both cases we are leaving the nice world of regular exponential family distributions 1

the score function having the desired asymptotic property of efficiency (and which generally does not correspond to the global maximum of the likelihood function) is a difficult one of keen interest. In this regard, you might want to read the paper Eliminating Multiple Root Problems in Estimation cited above. Note from your readings in Kay that the Newton-Raphson procedure uses the Sample Information Matrix (SIM), while the Method of Scores uses the actual Fisher Information Matrix (FIM) which is the expected value of the SIM. As discussed in Kay, in the limit of an infinitely large number of samples it is the case that FIM SIM. Thus a perceived merit of the Newton-like numerical approaches is that an asymptotically valid estimate of the parameter error covariance matrix is provided by the availability of the SIM. 2 Important and useful background material on the history, applications, and convergence properties of the EM algorithm as a procedure for performing missing-data maximum likelihood estimation can be found in the classic paper by Dempster, Rubin, & Laird (DLR) cited above. The DLR paper discusses a variety of applications, including mixture-density estimation and the example given in Moon. In the DLR paper the complete data space is not a simple cartesian product of the actual data and hidden data spaces. The subsequent papers by Wu and Boyles, both cited above, discuss the lack of validity of the convergence proof given in the original DLR paper. Boyles gives a illuminating counterexample which contradicts the convergence claim made in DLR. Wu gives a rigorous and correct proof of convergence for the EM algorithm, and states sufficient conditions for convergence. Also cited above is the excellent and highly recommended SIAM Review paper by Redner & Walker on applying the EM algorithm to the problem of mixture density estimation. Despite the existence of excellent books now available of the subject of mixture density estimation, this classic paper is still well worth reading. The use of mixture densities to model arbitrary density functions and classification when class-labeled training data are not available are two immensely important uses of the EM algorithm. The convergence conditions given by DLR and Wu are clearly important for mixture density estimation. While mixtures of regular exponential family distributions (REFD s) result in convergent behavior, this is not true when one or more of the component densities of a mixture are not from an REFD. For example, a mixture of a uniform distribution and a gaussian distribution will not result in a convergent EM algorithm, as you will find in one of the homework problems below. One claimed potential drawback to the basic EM algorithm is the general lack of the availability of the SIM or FIM for use as a parameter error covariance matrix. 3 In fact, a 2 Recall that for regular statistical families the MLE asymptotically attains the C-R lower bound, which is the inverse of the FIM. The availability of the error covariance matrix coupled with the fact that the MLE is asymptotically normal allows one to easily construct confidence intervals (error bars) about the estimated parameter values. 3 On the one hand, the ability to avoid the explicit computation of the FIM (usually required to be updated at every iteration step) typically provides a huge computational savings relative to the Newton 2

careful reading of DLR shows that the SIM can be readily computed. This fact is discussed and made salient in the above-cited paper Direct Calculation of the Information Matrix via the EM Algorithm which discusses the additional computations (over and above the basic EM algorithm computations) needed to compute an estimate of the information matrix. Finally, the recent review paper by Tzikas et al. on the relationship between the EM algorithm and Variational approximation is a nice, short synopsis of material to be found in the textbook Pattern Recognition and Machine Learning, C. Bishop, Springer, 2006. The derivation of the EM algorithm given in this paper is slightly simpler than the one I give in lecture as (unlike DLR) they assume that the complete data space is the cartesian product of the actual data and hidden data spaces (Bishop makes this same simplifying assumption). Homework Problems and Matlab Programming Exercises 1. Let the scalar random variables X 1,, X n and Y 1,, Y n be all mutually independent, where Y i Poisson(β τ i ) and X i Poisson(τ i ), i = 1, n. The two data sets {X i } n i=1 and {Y i} n i=1 taken together comprise the complete data. The parameters β and τ i, i = 1,, n, are deterministic and unknown. 4 This provides a simple model of the incidence of a disease Y i seen in hospital i, where the underlying rate is a function of an overall (hospital independent) effect β due to the intrinsic virulence of the disease and an additional hospital specific factor τ i. The availability of the measurements X i allows one to directly estimate the hospital specific factor τ i. 5 (a) Find the complete information maximum likelihood estimate (MLE) of the unknown parameters. (b) Assume that the observation for X 1 is missing. (Thus each hospital i, except for hospital 1, is able to provide a measurement of the quantity X i.) Determine an EM algorithm for finding the actual information MLE of the unknown parameters. 2. Let A = [a ij ] be an arbitrary real m n matrix with ij-th component a ij. For f a differentiable scalar real-valued function of A, we define the matrix derivative and the algorithm. On the other hand, having this matrix available once the algorithm has finally converged to an estimate of the unknown parameters allows one to construct confidence intervals (parameter error bars) for these parameters. 4 Recall that a Poisson(λ) distribution has the form p Z (z λ) = e λ λ z z! z = 0, 1, with mean E {z} = λ. 5 For example, the parameter τ i could depend upon the size and overall health of the population in the region served by hospital i and X i could be the waiting time in minutes before an entering patient is first examined by a trained medical professional. Or τ i could depend on the quality of the health care provided by hospital i and X i could be the number of nurses with advanced training. Et cetera. 3

matrix gradient of f(a) in a component-wise manner respectively as [ ] ( ) T [ ] f(a) f(a) ; A f(a) A a ji A f(a) = f(a) a ij Note that this convention is consistent with the convention used to define the (covariant) vector derivative and (contravariant cartesian) gradient in ECE275A. For Σ square and invertible (but not necessarily symmetric) prove that Σ log det Σ = Σ 1 and Σ tr Σ 1 W = Σ 1 W Σ 1. Note that the condition for the first identity to be well-defined is that det Σ > 0. Assuming only that det Σ 0, what is log det Σ? Σ 3. Let y be a gaussian random vector with unknown mean µ and unknown covariance matrix Σ which is assumed to be full rank. Given a collection of N iid samples of y use the result of the previous homework problem to find the maximum likelihood estimates of µ and Σ. 6 Why is just determining a stationary point of the log likelihood function generally sufficient to claim uniqueness and optimality of the solution? 4. (i) In order to model the pdf of a scalar random variable, y, completely set up the M component Gaussian mixture density problem as a hidden data problem and then derive the EM algorithm for identifying the unknown mixture parameters. Assume that the individual mixture parameters are all independent. (ii) Write a program (in Matlab, or your favorite programming language) to implement your algorithm as requested in the problems given below. 5. Note that the algorithm derived in problem 4 above requires no derivatives, hessians, or step-size parameters in its implementation. Is this true for a complete implementation of the EM solution given by Kay as equations (7.58) and (7.59)? Why? Can one always avoid derivatives when using the EM algorithm? 6. On pages 290, 291, 293, and 294 (culminating in Example 9.3), Kay shows how to determine consistent estimates for the parameters of a zero mean, unimodal gaussian mixture. What utility could these parameters have in determining MLE s of the parameters via numerical means? 6 Do NOT use the gradient formulas for matrices constrained to be symmetric as is done by Moon and Stirling on page 558. Although the symmetrically constrained derivative formulas are sometimes useful, they are not needed here because the stationary point of the likelihood just happens to turn out to be symmetric (a happy accident!) without the need to impose this constraint. This means that the ( accidently symmetric) solution is optimal in the space of all matrices, not just on the manifold of symmetric matrices. I have never seen the symmetry constraint applied when taking the derivative except in Moon and Stirling. 4

7. Consider a two-gaussian mixture density p(y Θ), with true mixture parameters α = α 1 = 0.6, α = α 2 = 1 α = 0.4, means µ 1 = 0, µ 2 = 10, and variances σ 2 1, and σ 2 2. Thus Θ = {α, α, θ}, θ = {µ 1, µ 2, σ 2 1, σ 2 2}. The computer experiments described below are to be performed for the two cases σ1 2 = σ2 2 = 1 (i.e., both component standard deviations equal to 1) and σ1 2 = σ2 2 = 25 (i.e., both component standard deviations equal to 5). Note that in the first case the means of the two component densities are located 10 standard deviations apart (probably a rarely seen situation in practice) whereas in the second case the means of the two component densities are located 2 standard deviations apart (probably a not so rare situation). (a) Plot the true pdf p(y Θ) as a function of y. Generate and save 400 iid samples from the pdf. Show frequency histograms of the data for 25 samples, 50 samples, 100 samples, 200 samples, and 400 samples, plotted on the same plot as the pdf. (b) (i) Derive an analytic expression for the mean µ, and variance σ 2 of y as a function of the mixture parameters, and evaluate this expression using the true parameter values. (ii) Compute the sample mean µ and sample variance σ 2 of y using the 400 data samples and compare to the analytically determined true values. (iii) Show that the sample mean and sample variance are the ML estimates for the mean and variance of a (single) gaussian model for the data. Plot this estimated single gaussian model against the true two gaussian mixture density. (c) (i) Assume that the true density is unknown and estimate the parameters for a two-component gaussian mixture 7 using the EM algorithm starting from the following initial conditions, What happens and why? µ 1 = µ 2 = µ, σ 2 1 = σ 2 2 = σ 2, α = 1 2. (ii) Perturb the initial conditions slightly to destroy their symmetry and describe the resulting behavior of the EM algorithm in words. (iii) Again using all 400 samples, estimate the parameters for a variety of initial conditions and try to trap the algorithm in a false local optimum. Describe the results. (iv) Once you have determined a good set of initial conditions, estimate the mixture parameters for the two component model via the ME algorithm for 25, 50, 100, 200, and 400 samples. Show that the actual data log-likelihood function increases monotonically with each iteration of the EM algorithm. Plot your final estimated parameter values versus log 2 (Sample Number). 7 By visual inspection of the data frequency histograms, one can argue that whatever the unknown distribution is, it is smooth with bimodal symmetric humps and hence should be adequately modelled by a two component gaussian mixture. 5

(d) Discuss the difference between the case where the component standard deviations both have the value 1 and the case where they are both have the value 5. 8. (i) For the two component gaussian mixture case derive the EM algorithm when the constraint σ 2 1 = σ 2 2 = σ 2 is known to hold. (I.e., incorporate this known constraint into the algorithm.) Compare the resulting algorithm to the one derived in Problem 4 above. (ii) Estimate the unknown parameters of a two-component gaussian mixture using the synthetic data generated in Problem 4 above using your modified algorithm and compare the results. 9. EM computer experiment when one of the components is non-regular. Consider a twocomponent mixture model where the first component is uniformly distributed between 0 and θ 1, U[0, θ 1 ], with mixture parameter α = 0.55, and the second component is normally distributed with mean µ and standard deviation σ 2, θ 2 = (µ, σ 2 ) T, with mixture parameter ᾱ = 1 α = 0.45. Assume that all of the parameters are independent of each other. Assume that 0 log 0 = 0. (a) i. Generate 1000 samples from the distribution described above for θ 1 = 1, µ = 3, and 2 σ2 = 1. Plot a histogram of the data and the density function 16 on the same graph. ii. Now assume that you do not known how the samples were generated and learn the unknown distribution using 1, 2, 3,10, and 50 component gaussian mixture models by training on the synthesized data. Plot the true mixture distribution and the 5 learned gaussian mixture models on the same graph. Recall that the MLE-based density estimate is asymptotically closest to the true density in the Kullback-Liebler Divergence sense. Note that our learned models approximate an unknown non-regular mixture density by regular gaussian mixtures. Show that the actual data log-likelihood function increases monotonically with each iteration of the EM algorithm. iii. Repeat the above for the values µ = 1, 1 and 2, with all other parameters 2 unchanged. (b) Suppose we now try to solve the non-regular two component mixture problem described above by forming a uniform distribution plus gaussian two-component mixture model and applying the EM algorithm in an attempt to learn the components parameters θ 1, µ and σ 2 directly. Convince yourself that in setting up the EM algorithm the Gaussian component can be handled as has already been done above, so that the novelty, and potential difficulty, is in dealing with the non-regular uniform distribution component. i. In the M-step of the EM algorithm show that if ˆx l,1 0 for some sample y l > θ 1 then the sum N ˆx j,1 log p 1 (y j ; θ 1 ), j=1 6

where p 1 (y j ; θ 1 ) denotes the uniform distribution, is not bounded from below. What does this say about the admissible values of ˆθ 1 +? ii. Show that maximizing N ˆx j,1 log p 1 (y j ; θ 1 ) j=1 with respect to θ 1 is equivalent to maximizing ( ) ˆN1 1 θ 1 j 1 [0,θ1 ](y j ) with θ 1 y j for all y j > 0 (1) where {y j } are the data samples for which ˆx j,1 0. 1 [0,θ1 ](y j ) is the indicator function that indicates whether or not y j is in the closed interval [0, θ 1 ]. Maximize the expression (1) to obtain the estimate ˆθ + 1. (Hint: Compare to the full information MLE solution you derived last quarter.) Explain why the estimate provided by the EM algorithm remains stuck at the value of the very first update ˆθ +, and never changes thereafter, and why the value one gets stuck at depends on the initialization value of ˆθ 1. Verify these facts in simulation (see below for additional simulation requests). Try various initialization values for ˆθ 1, including ˆθ 1 > max j y j. (Note that only initialization values that have an actual data log-likelihood that is bounded from below need be considered.) Show that the actual data log-likelihood function monotonically increases with each iteration of the EM algorithm (i.e., that we have monotonic hill-climbing on the actual data log-likelihood function). iii. Let us attempt to heuristically attempt to fix the stickiness problem of the EM algorithm with an ad hoc attempt at stabilizing the algorithm. There is no guarantee that the procedure outlined below will work as we are not conforming to the theory of the EM algorithm. Thus, monotonic hill-climbing on the actual data log-likelihood function is not guaranteed. First put a floor on the value of ˆx j,1 for positive y j : If ˆx j,1 < ɛ and y j > 0 set ˆx j,1 = ɛ and ˆx j,2 = 1 ɛ Choose ɛ to be a very small value. In particular choose ɛ 1/ˆθ 1 where ˆθ 1 > max y j is the initialization value of the estimate of θ 1. j Next define two intermediate estimates, ˆθ 1 and ˆθ 1 = 2ˆN1 N j=1 7 ˆx j,1 y j with ˆN1 = ˆθ 1, of θ 1 as follows. N j=1 ˆx j,1

ˆθ 1 = max ˆx j,1 y j e j ( yj ˆθ 1 ) 2 τ In the complete information case, which corresponds to setting ˆx j,1 = x j,1 and ˆN 1 = N 1, the estimate ˆθ 1 corresponds to the BLUE (and method of moments) estimate of θ 1. We can refer to ˆθ 1 as the pseudo-blue estimate. The estimate ˆθ 1 modifies the EM update estimate by adding the factor ˆx j,1, which weights y j according to the probability that it was generated by the uniform distribution, and an exponential factor which weights y j according to how far it is from the pseudo-blue estimate of θ 1. We can refer to ˆθ 1 as the weighted EM update of θ 1. The parameter τ determines how aggressively we penalize the deviation of y j from ˆθ 1. We construct our final Modified EM update of θ 1 as ˆθ + 1 = β ˆθ 1 + β ˆθ 1 for 0 β 1 and β = 1 β For example, setting β = 1/2 yields ˆθ + 1 = 1 2 (ˆθ 1 + We see that there are three free parameters to be set when implementing the Modified EM update step. Namely, ɛ, τ, and β. Attempt to run the modified EM algorithm on 1000 samples generated from data for the cases θ 1 = 1, σ 2 = 1 and µ = 1, 1, 3, and 2. Plot the actual 16 2 2 data log-likelihood function to see if it fails to increase monotonically with each iteration of the EM algorithm. If you can t get the algorithm to work, don t worry I thought up this heuristic fix on the fly and it definitely is not motivated by any rigorous mathematics. 8 If you can get the procedure to work, describe its performance relative to the pure Gaussian mixture cases investigated above and show its performance for various choices of β, including β = 0, 0.5, and 1. ) ˆθ 1 8 I dreamed up this fix a few years ago. So far only one student has actually got it to work and work surprisingly well! Of course I m always curious as to how many students can get it to work every time I teach this course. 8