ECE 592 Topics in Data Science

ECE 592 Topics in Data Science Final Fall 2017 December 11, 2017 Please remember to justify your answers carefully, and to staple your test sheet and answers together before submitting. Name: Student ID: Question 1 (Warm-up questions. [6 pts]) Below are a few quick questions that you should be able to answer quickly. (a) [2 pts] Consider a graph G(V, E), where V are the vertices and E are edges. Given that G is a forest comprised of k trees, how many edges does it have? (Hint: your answer should relate E and V, the cardinalities of the sets E and V.) Solution: The graph can be partitioned into k sub-graphs, G 1 (V 1, E 1 ),..., G k (V k, E k ), where each sub-graph is a tree. Note that V = k i=1 V i = V is unchanged, because every vertex belongs in one sub-graph. Moreover, the number of edges is E = k i=1 E i = k i=1 ( V i 1) = V k. (b) [2 pts] Describe an application where hardware can acquire linear measurements in a compressive way. Solution: Many students described a single pixel camera, where light from the scene is reflected at the focal plane off an array of small mirrors. Each mirror either sends photons corresponding to that pixel toward the sensor, or deflects them away from the sensor. Therefore, each measurement sums over the pixels whose corresponding mirrors are directed at the sensor, and this measurement approximates an inner product between a binary measurement vector (a row of our matrix) and a vectorized version of the image. (c) [2 pts] Describe a type of signal for which linear approximation is much worse than nonlinear approximation. Make sure to justify your answer. Solution: Non-linear approximation involves selecting the largest transform coefficients, while linear approximation involves selecting the same coefficients every time. For some types of signals, linear approximation makes sense, for example audio signals are often roughly low pass. In contrast, for signals such as 2D images represented by 2D wavelets, the location of edges affects where the wavelet coefficients of interest are located, and non-linear approximation often works well. Question 2 (Bayes classification. [6 pts]) You are given training vectors labeled into two classes, where each vector is in p = 2 dimensions. The training set is comprised of the following points; these will be plotted below. Class 1: {(11, 11), (13, 11), (8, 10), (9, 9), (7, 7), (7, 5), (15, 3)} 1

Class 2: {(7, 11), (15, 9), (15, 7), (13, 5), (14, 4), (9, 3), (11, 3)} These vectors follow Gaussian distributions, where Class 1 has the following mean vector and covariance matrix, [ ] [ ] 10 10 0 µ 1 =, Σ 8 1 =, 0 10 and Class 2 has the following mean vector and covariance matrix, [ ] [ ] 12 10 0 µ 2 =, Σ 6 2 =. 0 10 For ease of visualization, we include a scatter plot of the feature vectors, with Classes 1 and 2 denoted by crosses and squares, respectively. (a) [1 pt] Are the two classes linearly separable? Please explain. Solution: Two classes are linearly separable if some hyper plane (in our case a line) can perfectly separate the two. In our case, we have outliers in the top left and bottom right, no line can perfectly separate the crosses and squares, and our two classes are not linearly separable. (b) [2 pts] In the scatter plot, indicate the decision boundary that you would get using a Bayesian classifier; note that both classes use the same diagonal covariance matrix. Solution: There were many ways to solve this question. Because the classes are Gaussian and the covariance matrices are both diagonal with the same variance for each component, the Bayesian classifier is linear. (Some students confused the linearity of the Bayesian classifier with part a, which referred to the specific data at hand.) Because the covariance matrices are scaled identity matrices, the classifier selects the closer class mean between µ 1 and µ 2. It can be formally shown (and quite a few students did so) that a diagonal with slope one bisecting the midpoint between the two is the Bayesian classifier, but a sketch based on the intuition described here was fine. 2

(c) [1 pt] In the same scatter plot, sketch the decision boundary that you would get using a nearest neighbor (NN) classifier using k = 1. Solution: The points are arranged in such a way that the NN classifier is mostly identical to the Bayes, except for the two outlier points (see part a above) carving out areas around them. (d) [2 pts] Classify test samples (6,11) and (14,3) using the Gaussian Bayes classifier. (Hint: you need not go into detailed derivations if you can explain the output of the classifier.) Solution: For the first sample, (6,11), its squared distance from µ 1 is (10 6) 2 + (8 11) 2 = 16+9 = 25, and its squared distance from µ 2 is (12 6) 2 +(6 11) 2 = 36+25 = 61. The first sample is closer to µ 1 and is in Class 1. For the second test sample, (14,3), it can be shown to be closer to µ 2, and is in Class 2. Question 3 (Dynamic programming (DP). [3 pts]) A criminal approaches you and wants your advice about producing fake coins. The criminal can produce three types of coins. The values of these coins are v 1 = 1, v 2 = 4, and v 3 = 11, and the number of units of metal needed to produce each such coin is w 1 = 2, w 2 = 5, and w 3 = 13, respectively. The criminal wants to produce N value in coins using as little metal as possible. Use dynamic programming to compute Ψ(N), the minimal number of units of metal needed to produce N value in coins. Solution: DP involves two parts. First, the basis case covers situations where N is small. (Some students just showed that Ψ(1) = 2 while others listed values up to N = 11.) Second, when we move beyond the basis case to larger N, we must minimize among (i) producing a coin whose value is N (only feasible for N {1, 4, 11}); (ii) partitioning into sub-problems of size k and N k and plugging in our previously computed solutions, Ψ(k) and Ψ(N k), where we must minimize the amount of metal over various possible values of k. Different students solved this in various ways, but these themes repeated themselves in many solutions. Question 4 (Principal component analysis (PCA). [6 pts]) In a sample of 36 flea beetles, 6 different physical characteristics are measured for each beetle: head length, body length, length of one joint, length of a second joint, total weight, and body temperature. The first four are measured in micrometers, the fifth is measured in milligrams, and the sixth is measured in degrees Celsius. The outcome of a principal component analysis (PCA) and a related plot are given in two figures below. Based on these results, answer the following questions: 3

(a) [2 pts] The correlation matrix is used here instead of the covariance matrix. Explain why the correlation matrix is a more sensible choice than the covariance matrix for this analysis. Solution: The covariance matrix would rely on the actual numbers, and due to different units used, the scale can be grossly different. In contrast, correlations normalize everything to [ 1, +1], which aides visualization. For example, in the second table the various values in PC1 associated with the beetle s size (0.489, 0.495, 0.437, 0.372, 0.496) are all similar in the correlation domain, implying that PC1 can be interpreted as the size of a beetle. (b) [2 pts] How much of the variation in this dataset is explained by the first principal component? How much is explained by the first and the second components together? Solution: In the first table, we can see that PC1 accounts for 0.674 of the variation while PC2 accounts for 0.174; together they account for 0.847. (c) [2 pts] A plot of the principal component scores for the 36 beetles is shown in the figure below. Imagine that the horizontal and vertical axes are drawn on the plot, dividing it into 4 quadrants: UR (upper right), UL (upper left), LR (lower right), and LL (lower left). Suppose that two new beetles, Big Bob and Tiny Tim, are measured. Bob is much larger than any of the other beetles and is also slightly warmer. Tim, on the other hand, is very small compared with the other beetles but like Bob, Tim is warmer than the others. For both Bob and Tim, which quadrant of the above graph would each lie in? How do you know? Solution: As discussed earlier, PC1 corresponds to the size of a beetle. Similarly, in PC2 the weight of the temperature is 0.824 while the weights corresponding to the other size variables are modest. Therefore, PC2 corresponds to the temperature. Big Bob is larger than usual (PC1 is positive) and warm (PC2 is positive), corresponding to the upper right (UR) quadrant. Similarly, Tiny Tim is smaller (PC2 is negative) and warm (PC2 is positive), corresponding to the upper left (UL) quadrant. Question 5 (Individual projects. [4 pts]) From the individual project presentations held in class, select any two (excluding your own). For each one, briefly describe: 4

(i) The motivation for the project, i.e., what problem was solved and why. (ii) Which data science techniques were used in the project. Solution: The objective of this question was to ensure that students actually viewed their project and other students projects as an interesting learning experience worth hearing about. Most students got 3.5 4 points here; those who described their own project lost 2 points. 5