A Bayesian Approach to Concept Drift

Size: px
Start display at page:

Download "A Bayesian Approach to Concept Drift"

Transcription

1 A Bayesian Approach to Concept Drift Stephen H. Bach Marcus A. Maloof Department of Computer Science Georgetown University Washington, DC 20007, USA {bach, Abstract To cope with concept drift, we placed a probability distribution over the location of the most-recent drift point. We used Bayesian model comparison to update this distribution from the predictions of models trained on blocks of consecutive observations and pruned potential drift points with low probability. We compare our approach to a non-probabilistic method for drift and a probabilistic method for change-point detection. In our experiments, our approach generally yielded improved accuracy and/or speed over these other methods. 1 Introduction Consider a classification task, in which the objective is to assign labels Y to vectors of one or more attribute values X. To learn to perform this task, we use training data to model f : X Y, the unknown mapping from attribute values to labels, or target concept, in hopes of maximizing classification accuracy. A common problem in online classification tasks is concept drift, which is when the target concept changes over time. Identifying concept drift is often difficult. If the correct label for some x is y 1 at time step t 1 and y 2 at time step t 2, does this indicate concept drift or that the training examples are noisy? Researchers have approached drift in a number of ways. Schlimmer and Grainger [1] searched for candidate models by reweighting training examples according to how well they fit future examples. Some have maintained and modified partially learned models, e.g., [2, 3]. Many have maintained and compared base models trained on blocks of consecutive examples to identify those that are the best predictors of new examples, e.g., [4, 5, 6, 7, 8]. We focus on this approach. Such methods address directly the uncertainty about the existence and location of drift. We propose using probability theory to reason about this uncertainty. A probabilistic model of drift offers three main benefits to the research community. First, our experimental results show that a probabilistic model can achieve new combinations of accuracy and speed on classification tasks. Second, probability theory is a well-developed theory that could offer new insights into the problem of concept drift. Third, probabilistic models can easily be combined in a principled way, and their use in the machine-learning field continues to grow [9]. Therefore, our model could readily and correctly share information with other probabilistic models or be incorporated into broader ones. In this paper we present a probabilistic model of the number of most-recent training examples that the active concept describes. Maximum-likelihood estimation would overfit the model by concluding that each training was generated by a different target concept. This is unhelpful for future predictions, since it eliminates all generalization from past examples to future predictions. Instead, we use Bayesian model comparison [9], or BMC, to reason about the trade-offs between model complexity (i.e., the number of target concepts) and goodness of fit. We first describe BMC and its application to detecting change points. We then describe a Bayesian approach to concept drift. Finally, we show the results of an empirical comparison among our method (pruned and unpruned), BMC for change points, and Dynamic Weighted Majority [5], an ensemble method for concept drift. 1

2 2 Bayesian model comparison BMC uses probability theory to assign degrees of belief to candidate models given observations and prior beliefs [9]. By Bayes Theorem, p(m D) = p(d M)p(M) p(d), where M is the set of models under consideration and D is the set of observations. Researchers in Bayesian statistics have used BMC to look for change points in time-series data. The goal of change-point detection is to segment sequences of observations into blocks that are identically distributed and usually assumed to be independent. 2.1 Previous work on Bayesian change-point detection Barry and Hartigan [10, 11] used product partition models as distributions over possible segmentations of time-series data. Exact inference requires O(n 3 ) time in the number of observations and may be accurately approximated in O(n) time using Markov sampling [10]. In an online task, approximate training and testing on n observations would require O(n 2 ) time, since the model must be updated after new training data. These updates would require resampling and testing for convergence. Fearnhead [12] showed how to perform direct simulation from the posterior distribution of a class of multiple-change-point models. This method requires O(n 2 ) time and avoids the need to use Markov sampling and to test for convergence. Again, an approximate method can be performed in approximately linear time, but the model must be regularly rebuilt in online tasks. The computational costs associated with offline methods make it difficult to apply them to online tasks. Researchers have also looked for online methods for change-point detection. Fearnhead and Liu [13] introduced an online version of Fearnhead s simulation method [12] which uses particle filtering to quickly update the distribution over change points. Adams and MacKay [14] proposed an alternative method for online Bayesian change-point detection. We now describe it in more detail, since it will be the starting point for our own model. 2.2 A method for online Bayesian change-point detection Adams and MacKay [14] proposed maintaining a discrete distribution over l t, the length in time steps of the longest substrings of observations that are identically distributed, ending at time step t. This method therefore models the location of only the most recent change point, a cost-saving measure useful for many online problems. A conditional prior distribution p(l t l t 1 ) is used, such that λ 1 if l t = 0; p(l t l t 1 ) = 1 λ 1 if l t = l t 1 + 1; 0 otherwise. In principle, a more sophisticated prior could be used. The crucial aspect is that, given that a substring is identically distributed, it assigns mass to only two outcomes: the next observation is distributed identically to the observations of the substring, or it is the first of a new substring. The algorithm is initialized at time step 0 with a single base model that is the prior distribution over observations. Initially, p(l 0 = 0) = 1. Let D t be the observation(s) made at time step t. At each time step the algorithm computes a new posterior distribution p(l t D 1:t ) by marginalizing out l t 1 from p(l t, l t 1 D 1:t ) = p(d t l t, D 1:t 1 )p(l t l t 1 )p(l t 1 D 1:t 1 ). (2) p(d t D 1:t 1 ) This is a straightforward summation over a discrete variable. To find p(l t, l t 1 D 1:t ), consider the three components in the numerator. First, p(l t 1 D 1:t 1 ) is the distribution that was calculated at the previous time step. Next, p(l t l t 1 ) is the prior distribution. Since only two outcomes are assigned any mass, each element in p(l t 1 D 1:t 1 ) contributes mass to only two points in the posterior distribution. This keeps the algorithm linear in the size of the ensemble. Finally, p(d t l t, D 1:t 1 ) = p(d t D t lt:t 1). In other words, it is the predictive probability (1) 2

3 of a model trained on the observations received from time steps t l t to t 1. The denominator then normalizes the distribution. Once this posterior distribution p(l t D 1:t ) is calculated, each model in the ensemble is trained on the new observation. Then, a new model is initialized with the prior distribution over observations, corresponding to l t+1 = 0. 3 Comparing conditional distributions for concept drift We propose a new approach to coping with concept drift. Since the objective is to maximize classification accuracy, we want to model the conditional distribution p(y X) as accurately as possible. Using [14] as a starting point, we place a distribution over l t, which now refers to the length in time steps that the currently active concept has been active. There is now an important distinction between BMC for concept drift and BMC for change points: BMC for concept drift models changes in p(y X), whereas BMC for change points models changes in the joint distribution p(y, X). We use the conditional distribution to look for drift points because we do not wish to react to changes in the marginal distribution p(x). A change point in the joint distribution p(y, X) could correspond to a change point in p(x), a drift point in p(y X), or both. Reacting only to changes in p(y X) means that we compare models on their ability to classify unlabeled attribute values, not generate those values. In other words, we assume that neither the sequence of attribute values X 1:t nor the sequence of class labels Y 1:t alone provide information about l t. Therefore p(l t l t 1, X t ) = p(l t l t 1 ) and p(l t 1 Y 1:t 1, X 1:t ) = p(l t 1 Y 1:t 1, X 1:t 1 ). We also assume that examples from different concepts are independent. We use Equation 1 as the prior distribution p(l t l t 1 ) [14]. Equation 2 is replaced with p(l t, l t 1 Y 1:t, X 1:t ) = p(y t l t, Y 1:t 1, X 1:t )p(l t l t 1 )p(l t 1 Y 1:t 1, X 1:t 1 ). (3) p(y t Y 1:t 1, X 1:t ) To classify unlabeled attribute values X with class label Y, the predictive distribution is p(y X) = t p(y X, Y 1:t, X 1:t, l t = i)p(l t = i). (4) i=1 We call this method Bayesian Conditional Model Comparison (BCMC). If left unchecked, the size of its ensemble will grow linearly with the number of observations. In practice, this is far too computationally expensive for many online-learning tasks. We therefore prune the set of models during learning. Let φ be a user-specified threshold for the minimum posterior probability a model must have to remain in the ensemble. Then, if there exists some i such that p(l t = i D 1:t ) < φ < p(l t = 0 l t 1 ), simply set p(l t = i D t ) = 0 and discard the model p(d D t i:t ). We call this modified method Pruned Bayesian Conditional Model Comparison (PBCMC). 4 Experiments We conducted an empirical comparison using our implementations of PBCMC and BCMC. We hypothesized that looking for drift points in the conditional distribution p(y X) instead of change points in the joint distribution p(y, X) would lead to higher accuracy on classification tasks. To test this, we included our implementation of the method of Adams and MacKay [14], which we refer to simply as BMC. It is identical to BCMC, except that it uses Equation 2 to compute the posterior over l t, where D (Y, X). We also hypothesized that PBCMC could achieve improved combinations of accuracy and speed compared to Dynamic Weighted Majority (DWM) [5], an ensemble method for concept drift that uses a heuristic weighting scheme and pruning. DWM is a top performer on the problems we considered [5]. Like the other learners, DWM maintains a dynamically-sized, weighted ensemble of models trained on blocks of examples. It predicts by taking a weighted-majority vote of the models predictions and multiplies the weights of those models that predict incorrectly by a constant β. It 3

4 then rescales the weights so that maximum weight is 1. Then if the algorithm s global prediction was incorrect, it adds a new model to the ensemble with a weight of 1, and it removes any models with weights below a threshold θ. In the cases of models which output probabilities, DWM considers a prediction incorrect if a model did not assign the most probability to the correct label. 4.1 Test problems We conducted our experiments using four problems previously used in the literature to evaluate methods for concept drift The STAGGER concepts [1, 3] are three target concepts in a binary classification task presented over 120 time steps. Attributes and their possible values are shape {triangle, circle, rectangle}, color {red, green, blue}, and size {small, medium, large}. For the first 40 time steps, the target concept is color = red size = small. For the next 40 time steps, the target concept is color = green shape = circle. Finally, for the last 40 time steps, the target concept is size = medium size = large. A number of researchers have used this problem to evaluate methods for concept drift [4, 5, 3, 1]. Per the problem s usual formulation, we evaluated each learner by presenting it with a single, random example at each time step and then testing it on a set of 100 random examples, resampled after each time step. We conducted 50 trials. The SEA concepts [8] are four target concepts in a binary classification task, presented over 50,000 time steps. The target concept changes every 12,500 time steps, and associated with each concept is a single, randomly generated test set of 2,500 examples. At each time step, a learner is presented with a randomly generated example, which has a 10% chance of being labeled as the wrong class. Every 100 time steps, the learner is tested on the active concept s test set. Each example consists of numeric attributes x i [0, 10], for i = 1,..., 3. The target concepts are hyperplanes, such that y = + if x 1 + x 2 θ, where θ {7, 8, 9, 9.5}, for each of the four target concepts, respectively; otherwise, y =. Note that x 3 is an irrelevant attribute. Several researchers have used a shifting hyperplane to evaluate learners for concept drift [5, 6, 7, 2, 8]. We conducted 10 trials. In this experiment, µ 0 = 5. The calendar-apprentice (CAP) data sets [15, 16] is a personal-scheduling task. Using a subset of 34 symbolic attributes, the task is to predict a user s preference for a meeting s location, duration, start time, and day of week. There are 12 attributes for location, 11 for duration, 15 for start time, and 16 for day of week. Each learner was tested on the 1,685 examples for User 1. At each time step, the learner was presented the next example without its label. After classifying it, it was then told the correct label so it could learn. The electricity-prediction data set consists of 45,312 examples collected at 30-minute intervals between 7 May 1996 and 5 December 1998 [17]. The task is to predict whether the price of electricity will go up or down based on five numeric attributes: the day of the week, the 30-minute period of the day, the demand for electricity in New South Wales, the demand in Victoria, and the amount of electricity to be transferred between the two. About 39% of the examples have unknown values for either demand in Victoria or the transfer amount. At each time step, the learner classified the next example in temporal order before being given the correct label and using it to learn. In this experiment, µ 0 = Experimental design We tested the learning methods on the four problems described. For STAGGER and SEA, we measured accuracy on the test set, then computed average accuracy and 95% confidence intervals at each time step. We also computed the average normalized area under the performance curves (AUC) with 95% confidence intervals. We used the trapezoid rule on adjacent pairs of accuracies and normalized by dividing by the total area of the region. We present both AUC under the entire curve and after the first drift point to show both a learner s overall performance and its performance after drift occurs. For CAP and electricity prediction, we measured accuracy on the unlabeled observations. All the learning methods used a model we call Bayesian Naive Bayes, or BNB, as their base models. BNB makes the conditionally independent factor assumption (a.k.a. the naive Bayes assumption) that the joint distribution p(y, X) factors into p(y ) n i=1 p(x i Y ) [9]. It calculates values for p(y X) as needed using Bayes Theorem. It takes the Bayesian approach to probabilities (hence the additional Bayes in the name), meaning that it places distributions over the parameters that govern 4

5 Table 1: Results for (a) the STAGGER concepts and (b) the SEA concepts. (a) STAGGER concepts Learner and Parameters AUC (overall) AUC (after drift) BNB, on each concept 0.912± ±0.007 PBCMC, λ = 20, φ = ± ±0.007 BCMC, λ = ± ±0.007 BMC, λ = ± ±0.008 DWM, β = 0.5, θ = ± ±0.007 BNB, on all examples 0.647± ±0.011 (b) SEA concepts Learner and Parameters AUC (overall) AUC (after drift) BNB, on each concept 0.974± ±0.002 DWM, β = 0.9, θ = ± ±0.001 BCMC, λ = 10, ± ±0.002 PBCMC, λ = 10, 000, φ = ± ±0.003 BMC, λ = ± ±0.003 BNB, on all examples 0.910± ±0.002 the distributions p(y ) and p(x Y ) into which p(y, X) factors. In our experiments, BNB predicted by marginalizing out the latent parameter variables to compute marginal likelihoods. Note that we use BNB, a generative model over p(y, X), even though we said that we wish to model p(y X) as accurately as possible. This is to ensure a fair comparison with BMC which needs p(y, X). We are more interested in the effects of looking for changes in each distribution, not which is a better model for the active concept. In our experiments, BNB placed Dirichlet distributions [9] over the parameters θ of multinomial distributions p(y ) and p(x i Y ) when X i was a discrete attribute. All Dirichlet priors assigned equal density to all valid values of θ. BNB placed Normal-Gamma distributions [9] over the parameters µ and λ of normal distributions p(x i Y ) when X i was a continuous attribute. p(µ, λ) = N (µ µ 0, (βλ) 1 )Gam(λ a, b). The predictive distribution is then a Student s t-distribution with mean µ and precision λ. In all of our experiments, β = 2 and a = b = 1. The value of µ 0 is specified for each experiment with continuous attributes. We also tested BNB as a control to show the effects of not attempting to cope with drift and BNB trained using only examples from the active concept (when such information was available) to show possible accuracy given perfect information about drift. Parameter selection is difficult when evaluating methods for concept drift. Train-test-and-validate methods such as k-fold cross validation are not appropriate because the observations are ordered and not assumed to be identically distributed. We therefore tested each learner on each problem using each of a set of values for each parameter. Due to limited space, we present results for each learning method using the best parameter settings we found. We make no claim that these parameters are optimal, but they are representative of the overall trends we observed. We performed this parameter search for all the learning methods. The parameters we tested were λ {10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000}, φ {10 2, 10 3, 10 4 }, β {0.25, 0.5, 0.75, 0.9}, and θ {10 2, 10 3, 10 4, 0}. 5

6 Table 2: Accuracy on the CAP and electricity data sets. PBCMC BCMC BMC DWM BNB λ = 10, 000, φ = 10 4 λ = 5, 000 λ = 10 β = 0.75, θ = 10 4 Location Duration Start Time Day of Week Average λ = 10, φ = 10 2 λ = 10 λ = 10 β = 0.25, θ = 10 3 Electricity Results and analysis Table 1 shows the top results for the STAGGER and SEA concepts. On the STAGGER concepts, PBCMC and BCMC performed almost identically and have a higher mean AUC than BMC, but their 95% confidence intervals overlap. PBCMC and BCMC outperformed DWM. On the SEA concepts, DWM was the top performer, matching the accuracy of BNB trained on each concept and outperforming all the other learner methods. BCMC was next, followed by PBCMC, then BMC, and the BNB. Table 2 shows the top results for the CAP and electricity data sets. DWM performed the best on the location and duration data sets, while BCMC performed best on the start time and day-of-week data sets. PBCMC matched the accuracy of BCMC on the day-of-week and duration data sets and came close to it on the others. DWM had the highest mean accuracy over all four tasks, followed by PBCMC and BCMC, then BMC, and finally BNB. BCMC performed the best on the electricity data set, closely followed by PBCMC. The first conclusion is clear: looking for changes in the conditional distribution p(y X) led to better accuracy than looking for changes in the joint distribution p(y, X). With the close exception of the duration problem in the CAP data sets, PBCMC and BCMC outperformed BMC, sometimes dramatically so. What is less clear is the relative merits of PBCMC and DWM. We now analyze these learners to better understand address this question Reactivity versus stability The four test problems can be partitioned into two subsets: those on which PBCMC was generally more accurate (STAGGER and electricity) and those on which DWM was (SEA and CAP). We can obtain further insight into what separates these two subsets by noting that both PBCMC and DWM can be said to have strategies, which are determined by their parameters. For PBCMC, higher values of λ mean that it will assign less probability initially to new models. For DWM, higher values of β mean that it will penalize models less for making mistakes. For both, lower values of φ and θ respectively mean that they are slower to completely remove poorly performing models from consideration. We can thus interpret these parameters to describe how reactive or stable the learners are, i.e., the degree to which new observations can alter their hypotheses [4]. The two subsets are also partitioned by the strategy which was superior for the problems in each. For both PBCMC and DWM, some of the most reactive parameterizations we tested were optimal on STAGGER and electricity, but some of the most stable were optimal on SEA and CAP. Further, we observed generally stratified results across parameterizations. For each problem, almost all of the parameterizations of the top learner were more accurate than almost all of the parameterizations of the other. This indicates that PBCMC was generally better for the concepts which favor reactivity, whereas DWM was generally better for the concepts which favor stability Closing the performance gaps We now consider why these gaps in performance exist and how they might be closed. Figure 1 shows the average accuracies of PBCMC and DWM at each time step on the STAGGER and SEA concepts. These are for the experiments reported in Table 1, so the parameters, numbers of trials, etc. are the same. We present 95% confidence intervals at selected time steps for both. Figure 1 shows that the 6

7 Predictive Accuracy (%) PBCMC, λ = 20, φ = 10-4 DWM, β = 0.5, θ = Time Step (t) Predictive Accuracy (%) PBCMC, λ = 10000, φ = 10-4 DWM, β = 0.9, θ = Time Step (t) (a) (b) Figure 1: Average accuracy on (a) the STAGGER concepts and (b) the SEA concepts. See text for details. better performing learners in each problem were faster to react to concept drift. This shows that DWM did not perform better on SEA simply by being more stable whether the concept was or not. On the SEA concepts, PBCMC did perform best with the most stable parameterization we tried, but its main problem was that it wasn t reactive enough when drift occurred. We first consider whether the problem is one of parameter selection. Perhaps we can achieve better performances by using a more reactive parameterization of DWM on certain problems and/or a more stable parameterization of PBCMC on other problems. Our experimental results cast doubt on this proposition. For the problems on which PBCMC was superior, DWM s best results were not obtained using the most reactive parameterization. In other words, simply using an even more reactive parameterization of DWM did not improve performance on these problems. Further, on the duration problem in the CAP data sets, PBCMC also achieved the reported accuracy using λ = 5000 and φ = 10 2, and on the location problem it acheived negligibly better accuracy using λ = 5000 and φ = 10 3 or φ = Therefore, simply using an even more stable parameterization of PBCMC did not improve performance on these problems either. BCMC, which is just PBCMC with φ = 0, did outperform PBCMC on SEA. It reacted more quickly than PBCMC did, but not as quickly as DWM did, and at a much greater computational cost, since it had to maintain every model in order to have the one(s) which would eventually gain weight relative to the other models. BCMC also was not a significant improvement over PBCMC on the location and duration problems. We therefore theorize that the primary reason for the differences in performance between PBCMC and DWM is their approaches to updating their ensembles, which determines how they react to drift. PBCMC favors reactivity by adding a new model at every time step and decaying the weights of all models by the degree to which they are incorrect. DWM favors stability by only adding a new model after incorrect overall predictions and only decaying weights of incorrect models, and then only by a constant factor. This is supported by the results on problems favoring reactive parameterizations compared with the results on problems favoring stable parameterizations. Further, that it is difficult to close the performance gaps with better parameter selection suggests that there is a range of reactivity or stability each favors. When parameterized beyond this range, the performance of each learner degrades, or at least plateaus. To further support this theory, we consider trends in ensemble sizes. Figure 2 shows the average number of models in the ensembles of PBCMC and DWM at each time step on the STAGGER and SEA concepts. These are again for the experiments reported in Table 1, and again we present 95% confidence intervals at selected time steps for both. The figure shows that the trends in ensemble sizes were roughly interchanged between the two learners on the two problems. On both problems, one learner stayed within a relatively small range of ensemble sizes, whereas the other continued to expand the ensemble when the concept was stable, only significantly pruning soon after drift. On STAGGER, PBCMC expanded its ensemble size far more, whereas DWM did on SEA. This agrees with our expectations for the synthetic concepts. STAGGER contains no noise, whereas SEA does, which complements the designs of the two learners. When noise is more likely, DWM will update 7

8 70 60 PBCMC, λ = 20, φ = 10-4 DWM, β = 0.5, θ = PBCMC, λ = 10000, φ = 10-4 DWM, β = 0.9, θ = Ensemble size Ensemble size Time Step (t) (a) Time Step (t) (b) Figure 2: Average numbers of models on (a) the STAGGER concepts and (b) the SEA concepts. See text for details. its ensemble more than when it is not as likely. However, when noise is more likely, PBCMC will usually have difficulty preserving high weights for models which are actually useful. Conversely, PBCMC regularly updates its ensemble, and DWM will have less difficulty maintaining high weights on good models because it only decays weights by a constant factor. Therefore, it seems that each learner reaches the boundary of its favored range of reactivity or stability when further changes in that direction cause it to either be so reactive that it often assigns relatively high probability of drift to many time steps for which there was no drift, or so stable that it cannot react to actual drift. On STAGGER, PBCMC matched the performance of BNB on the first target concept (not shown), whereas DWM made more mistakes as it reacted to erroneously inferred drift. On SEA, PBCMC needs to be parameterized to be so stable that it cannot react quickly to drift. 5 Conclusion and Future Work In this paper we presented a Bayesian approach to coping with concept drift. Empirical evaluations supported our method. We showed that looking for changes in the conditional distribution p(y X) led to better accuracy than looking for changes in the joint distribution p(y, X). We also showed that our Bayesian approach is competitive with one of the top ensemble methods for concept drift, DWM, sometimes beating and sometimes losing to it. Finally, we explored why each method sometimes outperforms the other. We showed that both PBCMC and DWM appear to favor a different range of reactivity or stability. Directions for future work include integrating the advantages of both PBCMC and DWM into a single learner. Related to this task is a better characterization of their relative advantages and the relationships among them, their favored ranges of reactivity or stability, and the problems to which they are applied. It also important to note that the more constrained ensemble sizes discussed above correspond to faster classification speeds. Future work could explore how to balance this desiderata with the desire for better accuracy. Finally, another direction is to integrate a Bayesian approach with other probabilistic models. With a useful probabilistic model for concept drift, such as ours, one could potentially incorporate existing probabilistic domain knowledge to guide the search for drift points or build broader models that use beliefs about drift to guide decision making. Acknowledgments The authors wish to thank the anonymous reviewers for their constructive feedback. The authors also wish to thank Lise Getoor and the Department of Computer Science at the University of Maryland, College Park. This work was supported by the Georgetown University Undergraduate Research Opportunities Program. 8

9 References [1] J. C. Schlimmer and R. H. Granger. Beyond incremental processing: Tracking concept drift. In Proceedings of the Fifth National Conference on Artificial Intelligence, pages , Menlo Park, CA, AAAI Press. [2] G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages , New York, NY, ACM Press. [3] G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden contexts. Machine Learning, 23:69 101, [4] S. H. Bach and M. A. Maloof. Paired learners for concept drift. In Proceedings of the Eighth IEEE International Conference on Data Mining, pages 23 32, Los Alamitos, CA, IEEE Press. [5] J. Z. Kolter and M. A. Maloof. Dynamic weighted majority: An ensemble method for drifting concepts. Journal of Machine Learning Research, 8: , Dec [6] J. Z. Kolter and M. A. Maloof. Using additive expert ensembles to cope with concept drift. In Proceedings of the Twenty-second International Conference on Machine Learning, pages , New York, NY, ACM Press. [7] H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages , New York, NY, ACM Press. [8] W. N. Street and Y. Kim. A streaming ensemble algorithm (SEA) for large-scale classification. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages , New York, NY, ACM Press. [9] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, Berlin-Heidelberg, [10] D. Barry and J. A. Hartigan. A Bayesian analysis for change point problems. Journal of the American Statistical Association, 88(421): , [11] D. Barry and J. A. Hartigan. Product partition models for change point problems. The Annals of Statistics, 20(1): , [12] Paul Fearnhead. Exact and efficient Bayesian inference for multiple changepoint problems. Statistics and Computing, 16(2): , [13] P. Fearnhead and Z. Liu. On-line inference for multiple changepoint problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(4): , September [14] R.P. Adams and D.J.C. MacKay. Bayesian online changepoint detection. Technical report, University of Cambridge, [15] A. Blum. Empirical support for winnow and weighted-majority algorithms: Results on a calendar scheduling domain. Machine Learning, 26:5 23, [16] T. M. Mitchell, R. Caruana, D. Freitag, J. McDermott, and D. Zabowski. Experience with a learning personal assistant. Communications of the ACM, 37(7):80 91, July [17] M. Harries, C. Sammut, and K. Horn. Extracting hidden context. Machine Learning, 32(2): ,

Using Additive Expert Ensembles to Cope with Concept Drift

Using Additive Expert Ensembles to Cope with Concept Drift Jeremy Z. Kolter and Marcus A. Maloof {jzk, maloof}@cs.georgetown.edu Department of Computer Science, Georgetown University, Washington, DC 20057-1232, USA Abstract We consider online learning where the

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Machine Learning (CS 567) Lecture 2

Machine Learning (CS 567) Lecture 2 Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

Online Estimation of Discrete Densities using Classifier Chains

Online Estimation of Discrete Densities using Classifier Chains Online Estimation of Discrete Densities using Classifier Chains Michael Geilke 1 and Eibe Frank 2 and Stefan Kramer 1 1 Johannes Gutenberg-Universtität Mainz, Germany {geilke,kramer}@informatik.uni-mainz.de

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

Probabilistic Graphical Models for Image Analysis - Lecture 1

Probabilistic Graphical Models for Image Analysis - Lecture 1 Probabilistic Graphical Models for Image Analysis - Lecture 1 Alexey Gronskiy, Stefan Bauer 21 September 2018 Max Planck ETH Center for Learning Systems Overview 1. Motivation - Why Graphical Models 2.

More information

Final Examination CS 540-2: Introduction to Artificial Intelligence

Final Examination CS 540-2: Introduction to Artificial Intelligence Final Examination CS 540-2: Introduction to Artificial Intelligence May 7, 2017 LAST NAME: SOLUTIONS FIRST NAME: Problem Score Max Score 1 14 2 10 3 6 4 10 5 11 6 9 7 8 9 10 8 12 12 8 Total 100 1 of 11

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University August 30, 2017 Today: Decision trees Overfitting The Big Picture Coming soon Probabilistic learning MLE,

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Model Complexity of Pseudo-independent Models

Model Complexity of Pseudo-independent Models Model Complexity of Pseudo-independent Models Jae-Hyuck Lee and Yang Xiang Department of Computing and Information Science University of Guelph, Guelph, Canada {jaehyuck, yxiang}@cis.uoguelph,ca Abstract

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/

More information

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan Bayesian Learning CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Bayes Theorem MAP Learners Bayes optimal classifier Naïve Bayes classifier Example text classification Bayesian networks

More information

Midterm 2 V1. Introduction to Artificial Intelligence. CS 188 Spring 2015

Midterm 2 V1. Introduction to Artificial Intelligence. CS 188 Spring 2015 S 88 Spring 205 Introduction to rtificial Intelligence Midterm 2 V ˆ You have approximately 2 hours and 50 minutes. ˆ The exam is closed book, closed calculator, and closed notes except your one-page crib

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

CS 231A Section 1: Linear Algebra & Probability Review

CS 231A Section 1: Linear Algebra & Probability Review CS 231A Section 1: Linear Algebra & Probability Review 1 Topics Support Vector Machines Boosting Viola-Jones face detector Linear Algebra Review Notation Operations & Properties Matrix Calculus Probability

More information

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang CS 231A Section 1: Linear Algebra & Probability Review Kevin Tang Kevin Tang Section 1-1 9/30/2011 Topics Support Vector Machines Boosting Viola Jones face detector Linear Algebra Review Notation Operations

More information

Bayesian Learning Features of Bayesian learning methods:

Bayesian Learning Features of Bayesian learning methods: Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct. This provides a more

More information

an introduction to bayesian inference

an introduction to bayesian inference with an application to network analysis http://jakehofman.com january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification Robin Senge, Juan José del Coz and Eyke Hüllermeier Draft version of a paper to appear in: L. Schmidt-Thieme and

More information

CS 188: Artificial Intelligence Fall 2008

CS 188: Artificial Intelligence Fall 2008 CS 188: Artificial Intelligence Fall 2008 Lecture 23: Perceptrons 11/20/2008 Dan Klein UC Berkeley 1 General Naïve Bayes A general naive Bayes model: C E 1 E 2 E n We only specify how each feature depends

More information

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering CS 188: Artificial Intelligence Fall 2008 General Naïve Bayes A general naive Bayes model: C Lecture 23: Perceptrons 11/20/2008 E 1 E 2 E n Dan Klein UC Berkeley We only specify how each feature depends

More information

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 22 Dan Weld Learning Gaussians Naïve Bayes Last Time Gaussians Naïve Bayes Logistic Regression Today Some slides from Carlos Guestrin, Luke Zettlemoyer

More information

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007 Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

Final Exam, Fall 2002

Final Exam, Fall 2002 15-781 Final Exam, Fall 22 1. Write your name and your andrew email address below. Name: Andrew ID: 2. There should be 17 pages in this exam (excluding this cover sheet). 3. If you need more room to work

More information

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms Tom Bylander Division of Computer Science The University of Texas at San Antonio San Antonio, Texas 7849 bylander@cs.utsa.edu April

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

CS Homework 3. October 15, 2009

CS Homework 3. October 15, 2009 CS 294 - Homework 3 October 15, 2009 If you have questions, contact Alexandre Bouchard (bouchard@cs.berkeley.edu) for part 1 and Alex Simma (asimma@eecs.berkeley.edu) for part 2. Also check the class website

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Lecture 9: Bayesian Learning

Lecture 9: Bayesian Learning Lecture 9: Bayesian Learning Cognitive Systems II - Machine Learning Part II: Special Aspects of Concept Learning Bayes Theorem, MAL / ML hypotheses, Brute-force MAP LEARNING, MDL principle, Bayes Optimal

More information

Machine Learning

Machine Learning Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 1, 2011 Today: Generative discriminative classifiers Linear regression Decomposition of error into

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

More information

CSCE 478/878 Lecture 6: Bayesian Learning

CSCE 478/878 Lecture 6: Bayesian Learning Bayesian Methods Not all hypotheses are created equal (even if they are all consistent with the training data) Outline CSCE 478/878 Lecture 6: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchell

More information

Introduction to Bayesian Learning

Introduction to Bayesian Learning Course Information Introduction Introduction to Bayesian Learning Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Apprendimento Automatico: Fondamenti - A.A. 2016/2017 Outline

More information

Using HDDT to avoid instances propagation in unbalanced and evolving data streams

Using HDDT to avoid instances propagation in unbalanced and evolving data streams Using HDDT to avoid instances propagation in unbalanced and evolving data streams IJCNN 2014 Andrea Dal Pozzolo, Reid Johnson, Olivier Caelen, Serge Waterschoot, Nitesh V Chawla and Gianluca Bontempi 07/07/2014

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

More information

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function

More information

Expectation Propagation for Approximate Bayesian Inference

Expectation Propagation for Approximate Bayesian Inference Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given

More information

8. Classifier Ensembles for Changing Environments

8. Classifier Ensembles for Changing Environments 1 8. Classifier Ensembles for Changing Environments 8.1. Streaming data and changing environments. 8.2. Approach 1: Change detection. An ensemble method 8.2. Approach 2: Constant updates. Classifier ensembles

More information

The Perceptron algorithm

The Perceptron algorithm The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following

More information

An Introduction to Bayesian Machine Learning

An Introduction to Bayesian Machine Learning 1 An Introduction to Bayesian Machine Learning José Miguel Hernández-Lobato Department of Engineering, Cambridge University April 8, 2013 2 What is Machine Learning? The design of computational systems

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information

Model Averaging via Penalized Regression for Tracking Concept Drift

Model Averaging via Penalized Regression for Tracking Concept Drift Model Averaging via Penalized Regression for Tracking Concept Drift Kyupil Yeon, Moon Sup Song, Yongdai Kim, Hosik Choi Cheolwoo Park January 20, 2010 Abstract A supervised learning algorithm aims to build

More information

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization : Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage

More information

On the errors introduced by the naive Bayes independence assumption

On the errors introduced by the naive Bayes independence assumption On the errors introduced by the naive Bayes independence assumption Author Matthijs de Wachter 3671100 Utrecht University Master Thesis Artificial Intelligence Supervisor Dr. Silja Renooij Department of

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Machine Learning (CS 567) Lecture 3

Machine Learning (CS 567) Lecture 3 Machine Learning (CS 567) Lecture 3 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

Logistic regression and linear classifiers COMS 4771

Logistic regression and linear classifiers COMS 4771 Logistic regression and linear classifiers COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

CS 188: Artificial Intelligence Spring Today

CS 188: Artificial Intelligence Spring Today CS 188: Artificial Intelligence Spring 2006 Lecture 9: Naïve Bayes 2/14/2006 Dan Klein UC Berkeley Many slides from either Stuart Russell or Andrew Moore Bayes rule Today Expectations and utilities Naïve

More information

Be able to define the following terms and answer basic questions about them:

Be able to define the following terms and answer basic questions about them: CS440/ECE448 Section Q Fall 2017 Final Review Be able to define the following terms and answer basic questions about them: Probability o Random variables, axioms of probability o Joint, marginal, conditional

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning CS 446 Machine Learning Fall 206 Nov 0, 206 Bayesian Learning Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Bayesian Learning Naive Bayes Logistic Regression Bayesian Learning So far, we

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

More information

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1 CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine Learning 1 IEEE Expert, October 1996 CptS 570 - Machine Learning 2 Given sample S from all possible examples D Learner

More information

Mining Classification Knowledge

Mining Classification Knowledge Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology COST Doctoral School, Troina 2008 Outline 1. Bayesian classification

More information

Machine Learning in Action

Machine Learning in Action Machine Learning in Action Tatyana Goldberg (goldberg@rostlab.org) August 16, 2016 @ Machine Learning in Biology Beijing Genomics Institute in Shenzhen, China June 2014 GenBank 1 173,353,076 DNA sequences

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

CS446: Machine Learning Fall Final Exam. December 6 th, 2016 CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains

More information

Midterm, Fall 2003

Midterm, Fall 2003 5-78 Midterm, Fall 2003 YOUR ANDREW USERID IN CAPITAL LETTERS: YOUR NAME: There are 9 questions. The ninth may be more time-consuming and is worth only three points, so do not attempt 9 unless you are

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

Performance Evaluation and Comparison

Performance Evaluation and Comparison Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Machine Learning

Machine Learning Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 13, 2011 Today: The Big Picture Overfitting Review: probability Readings: Decision trees, overfiting

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori

More information

INTRODUCTION TO PATTERN RECOGNITION

INTRODUCTION TO PATTERN RECOGNITION INTRODUCTION TO PATTERN RECOGNITION INSTRUCTOR: WEI DING 1 Pattern Recognition Automatic discovery of regularities in data through the use of computer algorithms With the use of these regularities to take

More information

Learning From Inconsistent and Noisy Data: The AQ18 Approach *

Learning From Inconsistent and Noisy Data: The AQ18 Approach * Eleventh International Symposium on Methodologies for Intelligent Systems, Warsaw, pp. 411-419, 1999 Learning From Inconsistent and Noisy Data: The AQ18 Approach * Kenneth A. Kaufman and Ryszard S. Michalski*

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification 10-810: Advanced Algorithms and Models for Computational Biology Optimal leaf ordering and classification Hierarchical clustering As we mentioned, its one of the most popular methods for clustering gene

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability

More information

An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Machine Learning Overview

Machine Learning Overview Machine Learning Overview Sargur N. Srihari University at Buffalo, State University of New York USA 1 Outline 1. What is Machine Learning (ML)? 2. Types of Information Processing Problems Solved 1. Regression

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Bayesian Networks BY: MOHAMAD ALSABBAGH

Bayesian Networks BY: MOHAMAD ALSABBAGH Bayesian Networks BY: MOHAMAD ALSABBAGH Outlines Introduction Bayes Rule Bayesian Networks (BN) Representation Size of a Bayesian Network Inference via BN BN Learning Dynamic BN Introduction Conditional

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

Finite Mixture Model of Bounded Semi-naive Bayesian Networks Classifier

Finite Mixture Model of Bounded Semi-naive Bayesian Networks Classifier Finite Mixture Model of Bounded Semi-naive Bayesian Networks Classifier Kaizhu Huang, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin,

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012 Machine Learning Hal Daumé III Computer Science University of Maryland me@hal3.name CS 421 Introduction to Artificial Intelligence 8 May 2012 g 1 Many slides courtesy of Dan Klein, Stuart Russell, or Andrew

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

CS 188: Artificial Intelligence. Machine Learning

CS 188: Artificial Intelligence. Machine Learning CS 188: Artificial Intelligence Review of Machine Learning (ML) DISCLAIMER: It is insufficient to simply study these slides, they are merely meant as a quick refresher of the high-level ideas covered.

More information

MIRA, SVM, k-nn. Lirong Xia

MIRA, SVM, k-nn. Lirong Xia MIRA, SVM, k-nn Lirong Xia Linear Classifiers (perceptrons) Inputs are feature values Each feature has a weight Sum is the activation activation w If the activation is: Positive: output +1 Negative, output

More information