AN EM ALGORITHM FOR HAWKES PROCESS
|
|
- Bryce Harper
- 6 years ago
- Views:
Transcription
1 AN EM ALGORITHM FOR HAWKES PROCESS Peter F. Halpin new york university December 17, 2012 Correspondence should be sent to Dr. Peter F. Halpin 246 Greene Street, Office 316E New York, NY Phone: (212) Fax: (212) Webpage:
2 Psychometrika Submission December 17, AN EM ALGORITHM FOR HAWKES PROCESS Abstract This manuscript addresses the EM algorithm developed in Halpin & De Boeck (in press). The runtime of the algorithm grows quadratically in the number of observations, making its application to large data sets impractical. A strategy for improving efficiency is introduced, and this results in linear growth for many applications. The performance of the modified algorithm is assessed using data simulation. Key words: Hawkes process; EM algorithm; maximum likelihood; runtime
3 Psychometrika Submission December 17, Introduction Halpin & De Boeck (in press) considered the time series analysis of bivariate event data in the context of dyadic interaction. They proposed the use of point processes, and in particular Hawkes process, as way to capture the temporal dependence between the actions of two individuals. Estimation was based on the so-called branching structure representation of Hawkes process, which they showed to be amenable to estimation via the EM algorithm (see also Veen & Schoenberg, 2008). Unfortunately, the runtime of the algorithm grows quadratically in the number observations, making its application to large data sets impractical. The present paper provides a modification of the original algorithm that substantially improves its runtime. The modification reduces the number of computations in the algorithm by tolerating a specified degree of rounding error, and this results in linear growth for many applications. The next section outlines Hawkes process in sufficient detail for this paper to be self-contained and gives an intuitive description of the problem to be addressed. The subsequent section presents the modification to the EM algorithm and illustrates some cases where this yields linear growth. The final section uses data simulation to arrive at a magnitude of rounding error that has a negligible effect on parameter recovery. Hawkes Process Under mild conditions, a point process can be uniquely defined in terms of its conditional intensity function (CIF). The main reason for specifying a point process in terms of its CIF is that this leads directly to an expression for its likelihood. A general form for the CIF is λ(t) = lim 0 E(M{(t, t + )} H t ) (1)
4 Psychometrika Submission December 17, where M{(a, b)} is random counting measure representing the number of events (i.e., isolated points) falling in the interval (a, b), E(M{(a, b)}) is the expected value, and H t is the σ-algebra generated by the time points t k, k N, occurring before time t R+ (see Daley & Vere-Jones, 2003). In this paper it is assumed that the probability of multiple events occurring simultaneously is negligible, in which case M is said to be orderly. Then for fixed t and sufficiently small values of, λ(t) is an approximation to the bernoulli probability of an event occurring in the interval (t, t + ), conditional on all of the events happening before time t. In applications, this means we are concerned with how the probability of an event changes over continuous time as a function previous events. Point processes extend immediately to the multivariate case. M{(a, b)} is then vector-valued and each univariate margin gives the number of a different type of event occurring in the time period (a, b). Although Halpin and De Boeck (in press) considered a bivariate model, this paper focusses on the univariate case since the problem to be addressed can be most simply explained in that situation. The CIF of Hawkes process can be specified as a linear causal filter: λ(t) = µ + t 0 φ(t s) dm(s). (2) The interpretation of equation (2) is unpacked in the following three points. 1. µ > 0 is a baseline, which can be a function of time but is here treated as a constant. 2. φ(u) is a response function that governs how the the process depends on its past. Hawkes process requires the following three assumptions: φ(u) 0, u 0; φ(u) = 0, u < 0; Together these assumptions imply that 0 φ(u)du 1. φ(u) = α f(u; ξ) (3)
5 Psychometrika Submission December 17, where 0 α 1 and f(u; ξ) is a probability density function on R+ with parameter ξ. Equation (3) presents a convenient method for parametrizing φ, with some common choices for f(u; ξ) being the exponential (e.g., Ogata, 1988; Truccolo, Eden, Fellows, Donoghue, & Brown, 2005), the two-parameter gamma (Halpin & Boeck, in press), and the power law distribution (Barabási, 2005; Crane & Sornette, 2008). Under this parameterization, α is referred to as the intensity parameter and f(u; ξ) the response kernel. 3. In the case that M is orderly, dm(u) = M(u + ) is representable as series of right-shifted Dirac delta functions and the integral reduces to a sum over all events in [0, t], yielding φ(t s) dm(s) = t j <t φ(t t j ). (4) Thus each new time point is associated with a response function describing how that time point affects the future of the process. Under the assumptions of Hawkes process, each new time point increases the probability of further events occurring in the immediate future (i.e., φ(u) is non-negative). The summation shows that the effect of multiple time points on the probability of further events is cumulative. For these reasons, Hawkes process is often referred to as self-exciting; the occurrence of one event increases the probability of further events, whose occurrence in turn increases the probability of even more events. In terms of applications this means that Hawkes process is appropriate for modelling clustering, which occurs when periods of high event frequency are separated by periods of relative inactivity. As noted, the CIF leads directly to an expression for the log-likelihood (see Daley & Vere-Jones, 2003): l(θ X) = ln(λ(t k )) k T 0 λ(s)ds (5) where [0, T ] is the observation period, X = t 1, t 2,... denotes the observed event times, and θ contains the parameters of the model. Substitution of equations (2) through (4) into
6 Psychometrika Submission December 17, equation (5) shows that the log-likelihood of Hawkes process contains the logarithm of a weighted sum of density functions. A similar situation occurs in finite mixture modelling (e.g., McLachlan & Peel, 2000) and nonlinear regression (e.g., Seber & Wild, 2003), where it is known to lead to numerical optimization problems related to ill-conditioning of and multiple roots in the likelihood function. In the present case the problem is aggravated by the fact that the number of densities appearing in the likelihood increases with the number observations, which is shown in equation (4). It is important to note that the number of model parameters does not grow with the number of time points; the densities are simply right-shifted. In general, if there are a total of n observed events, then there are a total of n(n 1)/2 response functions appearing in the log-likelihood of a univariate Hawkes process, not including the duplicated response functions appearing the integral. This is the source of the quadratic growth of the optimization problem, which is the issue to be dealt with in this paper. The quadratic growth is especially problematic because the EM algorithm proposed by Halpin and De Boeck (in press) requires the use of multiple starting values. This means that even moderately sized data sets cannot be estimated in a reasonable amount of time. For example, an actual runtime of over 24 hours was recorded for a problem with N 1500 events and 50 starting values (implemented in the C language on a machine with 2 GHz of processing speed). Because one of the most exciting potential applications of Hawkes process is to big data collected via computer-mediated communication (e.g., databases, twitter), it is important to have an estimation approach that is feasible for large samples. The following section outlines how that can be accomplished. Reducing Runtime by Introducing Rounding Error This section outlines the original EM algorithm suggested by Halpin and De Boeck (in press) and then considers how to reduce its runtime. The algorithm is based on alternative representation of Hawkes process, which is referred to as its branching structure. In terms
7 Psychometrika Submission December 17, of the EM algorithm, the branching structure provides the complete data representation of the model, whereas the causal filter in equation (2) is the incomplete data representation. Taking this approach, the logarithm of the sum of densities in equation (5) is replaced by the sum of their logarithms, which results in better conditioning of the numerical optimization problem and was shown to perform satisfactorily with relatively small data sets (N 400). Although the considerations of this section could also be made for equation (5), the focus is on the EM approach. The branching structure representation of Hawkes process is in terms of a cluster Poisson process. It was first proposed by Hawkes and Oakes (1974), who proved it to be equivalent to the representation given in the foregoing section. Their argument was very technical and it served to establish the existence and uniqueness of the process. The branching structure has also found more intuitive applications. For example, in ecology it is used to describe the growth of wildlife populations in terms of subsequent generations of offspring due to each immigrant (e.g., Rasmussen, 2011). In the context of disease control, it is interpreted as the number of people contaminated by each subsequent carrier (e.g., Daley & Vere-Jones, 2003). Veen and Schoenberg (2008) were the first to consider the branching structure as a strategy for obtaining maximum likelihood estimates (MLEs) of Hawkes process. For the present purpose, the effect of the branching structure is to decompose Hawkes process into n independent Poisson processes whose rate functions are given by the response functions in equation (3). These processes govern the number of offspring of each event. There is also an additional Poisson processes governing the number of immigrant events; this process has a rate function given by the baseline parameter µ. Importantly, each event t k is assumed to be due to one and only one of these independent Poisson processes: either one centered at its parent, t j, with t j < t k, or the baseline process. Consequenty, if we knew which process each event belonged to, estimation of Hawkes process would reduce to that for a collection of independent Poisson processes. It is therefore natural to introduce a
8 Psychometrika Submission December 17, missing variable that describes the specific process to which each event t k belongs, and proceed by means of the EM algorithm. As with other applications of the EM algorithm, the missing data need not correspond to the hypothesized data generating process; it can be treated merely as a tool for obtaining MLEs. The following notation is employed to set up the algorithm. Let Z = (Z 1, Z 2,, Z n ) denote the missing data. If an event t k is an offspring of event t j, t j < t k, this is denoted by setting Z k = j. If an even t k is an immigrant then Z k = 0. Also let φ j (u) denote the response functions governing each Poisson process, where it is understood that φ 0 (u) = µ. For j > 0, these response functions are identical to those introduced in equation (3) above, except the subscript serves to make explicit the centering event t j. Letting l(θ X, Z) denote the complete data log-likelihood, Halpin and De Boeck (in press) showed that Q(θ) = E Z X,θ l(θ X, Z) ( n = ln(φ j (t k t j )) Prob(Z k = j X, θ) j=0 k>j 0 T ) φ j (T t j ) (6) where Prob(Z k = j X, θ) = φ j (t k t j ) r<k φ r(t k t r ). (7) Equations (6) and (7) provide the necessary components of an EM algorithm for Hawkes process. Equation (7) is readily computed on the E step. On the M step these probabilities are treated as fixed and entered into equation (6). Using this approach, Halpin and De Boeck (in press) provided closed form solutions for the baseline parameter µ and the intensity parameter α. However, in order to obtain the parameters of the response kernel, it is necessary to numerically optimize the Q function. This is the computationally expensive part of the algorithm. Since the sum over k > j is the source of the quadratic growth of the Q function, let s
9 Psychometrika Submission December 17, first consider how this can be reduced. Recall that for j > 1, φ j (u) = α f(u; ξ) is just a weighted density on R+. For usual choices of the response kernel, f(u; ξ) 0 as u become large (i.e., response functions typically have a right tail that asymptotes at zero). Intuitively, this means that when t k t j is large, the contribution of φ j (t k t j ) to equation (6) will be negligible. In order to make this idea more formal, consider the sets W j = {k : f(t k t j ; ξ) < w} and let W denote the average of the cardinalities of the W j. Replacing the sum over k > j with the sum over k W j in equation (6) results in W n densities appearing in the double summation. This substitution will be referred to as the modified Q function and denoted Q. W is the linear growth factor of Q. The relative efficiency of Q over Q is R = W n n(n 1)/2 = 2 W /(n 1) The value of W depends on (a) ξ, which is updated throughout the optimization process, (b) w, which can be determined by the researcher, and (c) the actual observations t k, which are fixed. This makes is difficult to obtain analytical results on W. However, Table 1 provides evidence that it does not grow with n and it can be much smaller than (n 1)/2. ========================= Insert Table 1 about here ========================= The table was produced by simulating data using the inverse method (see Daley & Vere-Jones, 2003). The causal filter in equation (2) was used for simulation, not the branching structure. Three different sample sizes (N = 500, 1500, and 5000) were simulated
10 Psychometrika Submission December 17, from each of three different models. Model 1 and Model 2 used exponential response functions, with Model 1 having moderate intensity (α =.4) and Model 2 having high intensity (α =.8). This means that the data from Model 2 showed a much higher degree of clustering (i.e., a larger number of events occurring in close proximity to one another). Model 3 is also high intensity (α =.8) but used a two-parameter gamma kernel with shape parameter set to.5. The result is heavier-tailed response functions, which have been reported in various applications to human communication data (e.g., Barabási, 2005; Crane & Sornette, 2008; Halpin & Boeck, in press). The choices of intensity parameter are intended to reflect its possible range rather than realistic values; I have not seen intensity estimates greater than.5 in real data applications. For each simulated data set, Q was computed using the true parameter values and w = 1e-10. The main point to be taken from the Table 1 is that the values of W did not increase with n and therefore the rate of growth of Q was linear. The exact rate of linear growth depended on the parameters of the data generating model, with more clustered data showing faster growth. However, even at extraordinarily high intensities and even at the smallest sample size, the growth rate was much smaller than (n 1)/2. Based on these results, it reasonable to conclude that Q is more efficient to compute than Q. It should be emphasized that this depends on the type of response kernel; the approach outlined here will not work unless the response kernel has a right tail that asymptotes at zero. Table 1 does not address how the rounding error w affects the MLEs produced by the EM algorithm. That is the topic of the next section. Although this section has only focussed on the computation of the Q function, entirely similar remarks can be made about the computation of equation (7) on the E step, and about the computation of equation (5). Effect of Rounding Error on the EM algorithm This section considers how the rounding error w effects convergence and parameter recovery. Data were again simulated using the inverse method with the incomplete data
11 Psychometrika Submission December 17, model (equation (2)). The data-generating model used a two-parameter gamma density as the response kernel. The parameters of the data generating model are stated in the Table 3 and were based on the real data example reported in Halpin and De Boeck (in press). N = 250 data sets of size n = 500 time points were generated from the model. For each data set, the EM algorithm described in Halpin and De Boeck (in press) was implemented using Q in place of Q. The starting values for the estimation algorithm were obtained by randomly disturbing the data generating values, which avoided the need for multiple starting values. Convergence was evaluated using the incomplete data log-likelihood (equation (5)). The convergence criterion was an absolute difference of less than 1e-5 on subsequent M steps. The simulation compared the rounding errors w = 0, 1e-10, 1e-5, 1e-3. Because a rounding error of 0 is not possible in practice, this was implemented using w = 2.22e-16, which is the smallest double precision number representable most modern computers. Therefore the value w = 0 represents the amount of error that is intrinsic to the specific realization of the estimation process (i.e., with the given sample size, convergence criterion, etc). The remaining values of w represent the introduction of rounding error for computation efficiency. Let s first consider the role of rounding error in the convergence of the algorithm. Figure 1 shows the relationship between the log-likelihoods evaluated at the MLEs and the log-likelihoods evaluated at the data generating parameters. The relation is quite similar for the three smallest values of w, but is appreciably worse for the largest value. It is important to note that even for w = 0, the relationship is not perfect. The amount of additional error introduced by the two middle values of w is not perceptible in the figure. ========================= Insert Figure 1 about here =========================
12 Psychometrika Submission December 17, Table 2 provides a closer look at the log-likelihoods. It reports the mean and standard deviation for the differences between the log-likelihoods of the estimated models and the log-likelihoods computed using the true values. The table entries are reported as percentages of the difference between the log-likelihoods of w = 0 and of the true values (i.e., as percentages of the intrinsic estimation error). If w > 0 did not affect the convergence of the EM algorithm, all values in the table would be 100. Based on the table we can conclude that all values of w > 0 introduced additional error into the convergence of the EM algorithm. For w = 1e-10 this was less than.1 percent of the intrinsic estimation error. ========================= Insert Table 2 about here ========================= Turning now to address parameter recovery, Table 3 reports the bias and error of the MLEs for each level of rounding error. The entries are reported as percentages of the data generating parameters. It can be seen that bias and error were very similar for the lowest two values of w, but for larger values of w there is increased bias and reduced error. Figure 2 shows the distribution of estimates of the gamma response kernels for w = 0 and w = 1e-10. ========================= Insert Table 3 about here ========================= ========================= Insert Figure 2 about here =========================
13 Psychometrika Submission December 17, Based on this simulation it may be concluded that there is little to distinguish the results obtained using a rounding error of w = 1e-10 from the intrinsic error in the algorithm (i.e., w = 0). On the other hand, w 1e-5 has a relatively large influence both on the convergence of the algorithm and on the bias and error of the resulting parameter estimates. Conclusions The number of computations required by the EM algorithm proposed by Halpin and De Boeck (in press) grows quadratically in the number of observed events, making its application to large data sets infeasible. This paper has shown that the runtime of the algorithm can be reduced by introducing rounding error into the computation of the Q function (i.e. the objective function of the M step of the EM algorithm). In three applications involving response functions with right tails asymptoting at zero, this was shown to result in linear growth. The consequences for convergence of the algorithm and parameter recovery were also considered. A rounding error of 1e 10 was found to have negligible effects, but larger values did not. While more research can be done to optimize the rounding error for specific applications of the algorithm, it can be concluded that the approach presented here provides an acceptable compromise between runtime computational accuracy.
14 Psychometrika Submission December 17, References Barabási, A. L. (2005). The origin of bursts and heavy tails in human dynamics. Nature, 435, Crane, R., & Sornette, D. (2008). Robust dynamic classes revealed by measuring the response function of a social system. Proceedings of the National Academy of Sciences, 105, Daley, D. J., & Vere-Jones, D. (2003). An introduction to the theory of point processes: Elementary theory and methods (Second ed., Vol. 1). New York: Springer. Halpin, P. F., & Boeck, P. D. (in press). Modeling dyadic interaction using hawkes process. Psychometrika. Hawkes, A. G., & Oakes, D. (1974). A cluster representation of a self-exciting process. Journal of Applied Probability, 11, McLachlan, G. J., & Peel, D. (2000). Finite mixture models. New York: John Wiley and Sons. Ogata, Y. (1988). Statistical models for earthquake occurrences and residual analysis for point processes. Journal of the American Statistical Association, 83, Rasmussen, J. G. (2011). Bayesian inference for Hawkes processes. Methodology and Computing in Applied Probability, DOI: /s Seber, G. A. F., & Wild, C. J. (2003). Non-linear regression (2nd ed.). Hoboken, NJ: John Wiley & Sons. Truccolo, W., Eden, U. T., Fellows, M. R., Donoghue, J. P., & Brown, E. N. (2005). A point process framework for relating neural spiking activity to spiking history, neural ensemble, and e extrinsic covariate effects. Journal of Neurophysiology, 93, Veen, A., & Schoenberg, F. P. (2008). Estimation of space-time branching process models in seismology using an EM-type algorithm. Journal of the American Statistical
15 Psychometrika Submission December 17, Association, 103,
16 Psychometrika Submission December 17, Tables
17 Psychometrika Submission December 17, Table 1. Growth of the Q Function in Number of Time Points (Simulated Data) n = 500 n = 1500 n = 5000 Model Model Model Note: n is number of simulated time points and the table entries are the linear growth factor, W, of the modified Q function, Q, computed using the true parameter values. W n gives the number of computations required for Q and 2 W /(n 1) gives the efficiency of Q relative to the original Q function proposed by Halpin and De Boeck (in press). The models are described in the text.
18 Psychometrika Submission December 17, Table 2. Effect of Rounding Error on Log-likelihoods (Simulated Data) w = 0 w = 1e-10 w = 1e-5 w = 1e-3 Mean SD Note: Table entries are means (M) and standard deviations (SD) of differences between log-likelihoods of the estimated models and the log likelihoods computed using the true values. The means and standard deviations are reported as percentages of the values for w = 0 (i.e., percentages of the intrinsic estimation error). The MLEs were obtained using the EM algorithm described by Halpin and De Boeck (in press) with the modified Q function and the indicated levels of rounding error, w.
19 Psychometrika Submission December 17, Table 3. Effect of Rounding Error on Parameter Recovery (Simulated Data) µ α κ β True values w = (12.707) (14.282) (11.812) (49.986) w = 1e (12.725) (14.315) (11.824) (50.664) w = 1e (10.857) (11.592) (11.215) (22.937) w = 1e (9.969) (8.786) (17.114) (3.618) Note: Table entries are bias (error) of maximum likelihood estimates (MLEs) as percentages of the true values. µ denotes the baseline parameter of Hawkes process, α the intensity parameter, κ the shape parameters of the two-parameter gamma response kernel, and β its scale parameter. MLEs were obtained using the EM algorithm described by Halpin and De Boeck (in press) with the modified Q function and the indicated levels of rounding error, w.
20 Psychometrika Submission December 17, w = 0 w = 1-e10 True values r =.998 True values r = MLEs MLEs w = 1e-5 w = 1e-3 True values r =.998 True values r = MLEs MLEs Figure 1. Relation of log-likelihoods at convergence with log-likelihoods computed using the data generating values (simulated data). The model was estimated using the EM algorithm described by Halpin and De Boeck (in press) with the modified Q function presenting in this paper and the indicated levels of rounding error, w.
21 Psychometrika Submission December 17, w = 0 w = 0 Frequency Frequency MLEs: Shape MLEs: Scale w = 1e-10 w = 1e-10 Frequency Frequency MLEs: Shape MLEs: Scale Figure 2. Histograms of maximum likelihood estimates (MLEs) of the two-parameter gamma density kernel (simulated data). Bold vertical line indicates the value of the data generating parameters. MLEs were obtained using the EM algorithm described by Halpin and De Boeck (in press) with the modified Q function presented in this paper and the indicated levels of rounding error, w.
MODELLING DYADIC INTERACTION WITH HAWKES PROCESS
MODELLING DYADIC INTERACTION WITH HAWKES PROCESS Peter F. Halpin + & Paul De Boeck university of amsterdam November 8, 2012 + Now at New York University; Now at Ohio State University. This research was
More informationMeasuring Student Engagement During Collaboration
Measuring Student Engagement During Collaboration Peter F. Halpin Joint work with Alina von Davier 1 / 53 Outline Part 1: An overview of temporal point processes 1 Step 1 Defining and detecting dependence
More informationTemporal point processes: the conditional intensity function
Temporal point processes: the conditional intensity function Jakob Gulddahl Rasmussen December 21, 2009 Contents 1 Introduction 2 2 Evolutionary point processes 2 2.1 Evolutionarity..............................
More informationLikelihood Function for Multivariate Hawkes Processes
Lielihood Function for Multivariate Hawes Processes Yuanda Chen January, 6 Abstract In this article we discuss the lielihood function for an M-variate Hawes process and derive the explicit formula for
More informationStat 5101 Lecture Notes
Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationSimulation and Calibration of a Fully Bayesian Marked Multidimensional Hawkes Process with Dissimilar Decays
Simulation and Calibration of a Fully Bayesian Marked Multidimensional Hawkes Process with Dissimilar Decays Kar Wai Lim, Young Lee, Leif Hanlen, Hongbiao Zhao Australian National University Data61/CSIRO
More informationComputationally Efficient Estimation of Multilevel High-Dimensional Latent Variable Models
Computationally Efficient Estimation of Multilevel High-Dimensional Latent Variable Models Tihomir Asparouhov 1, Bengt Muthen 2 Muthen & Muthen 1 UCLA 2 Abstract Multilevel analysis often leads to modeling
More informationResiduals for spatial point processes based on Voronoi tessellations
Residuals for spatial point processes based on Voronoi tessellations Ka Wong 1, Frederic Paik Schoenberg 2, Chris Barr 3. 1 Google, Mountanview, CA. 2 Corresponding author, 8142 Math-Science Building,
More informationOn prediction and density estimation Peter McCullagh University of Chicago December 2004
On prediction and density estimation Peter McCullagh University of Chicago December 2004 Summary Having observed the initial segment of a random sequence, subsequent values may be predicted by calculating
More informationSTATISTICAL ANALYSIS WITH MISSING DATA
STATISTICAL ANALYSIS WITH MISSING DATA SECOND EDITION Roderick J.A. Little & Donald B. Rubin WILEY SERIES IN PROBABILITY AND STATISTICS Statistical Analysis with Missing Data Second Edition WILEY SERIES
More informationStatistical Properties of Marsan-Lengliné Estimates of Triggering Functions for Space-time Marked Point Processes
Statistical Properties of Marsan-Lengliné Estimates of Triggering Functions for Space-time Marked Point Processes Eric W. Fox, Ph.D. Department of Statistics UCLA June 15, 2015 Hawkes-type Point Process
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationThe propensity score with continuous treatments
7 The propensity score with continuous treatments Keisuke Hirano and Guido W. Imbens 1 7.1 Introduction Much of the work on propensity score analysis has focused on the case in which the treatment is binary.
More informationParametric Techniques Lecture 3
Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to
More informationParametric Techniques
Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure
More informationImproved characterization of neural and behavioral response. state-space framework
Improved characterization of neural and behavioral response properties using point-process state-space framework Anna Alexandra Dreyer Harvard-MIT Division of Health Sciences and Technology Speech and
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationDiscussion of Maximization by Parts in Likelihood Inference
Discussion of Maximization by Parts in Likelihood Inference David Ruppert School of Operations Research & Industrial Engineering, 225 Rhodes Hall, Cornell University, Ithaca, NY 4853 email: dr24@cornell.edu
More informationFULL LIKELIHOOD INFERENCES IN THE COX MODEL
October 20, 2007 FULL LIKELIHOOD INFERENCES IN THE COX MODEL BY JIAN-JIAN REN 1 AND MAI ZHOU 2 University of Central Florida and University of Kentucky Abstract We use the empirical likelihood approach
More informationAn Assessment of Crime Forecasting Models
An Assessment of Crime Forecasting Models FCSM Research and Policy Conference Washington DC, March 9, 2018 HAUTAHI KINGI, CHRIS ZHANG, BRUNO GASPERINI, AARON HEUSER, MINH HUYNH, JAMES MOORE Introduction
More informationA Bivariate Point Process Model with Application to Social Media User Content Generation
1 / 33 A Bivariate Point Process Model with Application to Social Media User Content Generation Emma Jingfei Zhang ezhang@bus.miami.edu Yongtao Guan yguan@bus.miami.edu Department of Management Science
More information1 A Tutorial on Hawkes Processes
1 A Tutorial on Hawkes Processes for Events in Social Media arxiv:1708.06401v2 [stat.ml] 9 Oct 2017 Marian-Andrei Rizoiu, The Australian National University; Data61, CSIRO Young Lee, Data61, CSIRO; The
More informationMassachusetts Institute of Technology
Massachusetts Institute of Technology 6.867 Machine Learning, Fall 2006 Problem Set 5 Due Date: Thursday, Nov 30, 12:00 noon You may submit your solutions in class or in the box. 1. Wilhelm and Klaus are
More informationCourse Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model
Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model EPSY 905: Multivariate Analysis Lecture 1 20 January 2016 EPSY 905: Lecture 1 -
More information12 - Nonparametric Density Estimation
ST 697 Fall 2017 1/49 12 - Nonparametric Density Estimation ST 697 Fall 2017 University of Alabama Density Review ST 697 Fall 2017 2/49 Continuous Random Variables ST 697 Fall 2017 3/49 1.0 0.8 F(x) 0.6
More informationMonte Carlo Studies. The response in a Monte Carlo study is a random variable.
Monte Carlo Studies The response in a Monte Carlo study is a random variable. The response in a Monte Carlo study has a variance that comes from the variance of the stochastic elements in the data-generating
More informationEstimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio
Estimation of reliability parameters from Experimental data (Parte 2) This lecture Life test (t 1,t 2,...,t n ) Estimate θ of f T t θ For example: λ of f T (t)= λe - λt Classical approach (frequentist
More informationMonsuru Adepeju 1 and Andy Evans 2. School of Geography, University of Leeds, LS21 1HB 1
Investigating the impacts of training data set length (T) and the aggregation unit size (M) on the accuracy of the self-exciting point process (SEPP) hotspot method Monsuru Adepeju 1 and Andy Evans 2 1,
More informationMixture Models and Expectation-Maximization
Mixture Models and Expectation-Maximiation David M. Blei March 9, 2012 EM for mixtures of multinomials The graphical model for a mixture of multinomials π d x dn N D θ k K How should we fit the parameters?
More informationPMR Learning as Inference
Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning
More informationHANDBOOK OF APPLICABLE MATHEMATICS
HANDBOOK OF APPLICABLE MATHEMATICS Chief Editor: Walter Ledermann Volume VI: Statistics PART A Edited by Emlyn Lloyd University of Lancaster A Wiley-Interscience Publication JOHN WILEY & SONS Chichester
More informationMA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems
MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems Principles of Statistical Inference Recap of statistical models Statistical inference (frequentist) Parametric vs. semiparametric
More informationStatistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart
Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart 1 Motivation and Problem In Lecture 1 we briefly saw how histograms
More informationEPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7
Introduction to Generalized Univariate Models: Models for Binary Outcomes EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7 EPSY 905: Intro to Generalized In This Lecture A short review
More informationTemporal Point Processes the Conditional Intensity Function
Temporal Point Processes the Conditional Intensity Function Jakob G. Rasmussen Department of Mathematics Aalborg University Denmark February 8, 2010 1/10 Temporal point processes A temporal point process
More informationThe Bayes classifier
The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal
More informationIntroduction to Reliability Theory (part 2)
Introduction to Reliability Theory (part 2) Frank Coolen UTOPIAE Training School II, Durham University 3 July 2018 (UTOPIAE) Introduction to Reliability Theory 1 / 21 Outline Statistical issues Software
More informationClick Prediction and Preference Ranking of RSS Feeds
Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS
More informationarxiv: v1 [math.st] 7 Jan 2014
Three Occurrences of the Hyperbolic-Secant Distribution Peng Ding Department of Statistics, Harvard University, One Oxford Street, Cambridge 02138 MA Email: pengding@fas.harvard.edu arxiv:1401.1267v1 [math.st]
More informationParameter estimation! and! forecasting! Cristiano Porciani! AIfA, Uni-Bonn!
Parameter estimation! and! forecasting! Cristiano Porciani! AIfA, Uni-Bonn! Questions?! C. Porciani! Estimation & forecasting! 2! Cosmological parameters! A branch of modern cosmological research focuses
More informationBiostat 2065 Analysis of Incomplete Data
Biostat 2065 Analysis of Incomplete Data Gong Tang Dept of Biostatistics University of Pittsburgh October 20, 2005 1. Large-sample inference based on ML Let θ is the MLE, then the large-sample theory implies
More informationThe formal relationship between analytic and bootstrap approaches to parametric inference
The formal relationship between analytic and bootstrap approaches to parametric inference T.J. DiCiccio Cornell University, Ithaca, NY 14853, U.S.A. T.A. Kuffner Washington University in St. Louis, St.
More informationIntroduction: MLE, MAP, Bayesian reasoning (28/8/13)
STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this
More informationModels for Multivariate Panel Count Data
Semiparametric Models for Multivariate Panel Count Data KyungMann Kim University of Wisconsin-Madison kmkim@biostat.wisc.edu 2 April 2015 Outline 1 Introduction 2 3 4 Panel Count Data Motivation Previous
More informationFall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.
1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n
More informationClassical and Bayesian inference
Classical and Bayesian inference AMS 132 January 18, 2018 Claudia Wehrhahn (UCSC) Classical and Bayesian inference January 18, 2018 1 / 9 Sampling from a Bernoulli Distribution Theorem (Beta-Bernoulli
More informationAsymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands
Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands Elizabeth C. Mannshardt-Shamseldin Advisor: Richard L. Smith Duke University Department
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationForecasting Data Streams: Next Generation Flow Field Forecasting
Forecasting Data Streams: Next Generation Flow Field Forecasting Kyle Caudle South Dakota School of Mines & Technology (SDSMT) kyle.caudle@sdsmt.edu Joint work with Michael Frey (Bucknell University) and
More informationMixture of Gaussians Models
Mixture of Gaussians Models Outline Inference, Learning, and Maximum Likelihood Why Mixtures? Why Gaussians? Building up to the Mixture of Gaussians Single Gaussians Fully-Observed Mixtures Hidden Mixtures
More informationMaximum Likelihood Estimation. only training data is available to design a classifier
Introduction to Pattern Recognition [ Part 5 ] Mahdi Vasighi Introduction Bayesian Decision Theory shows that we could design an optimal classifier if we knew: P( i ) : priors p(x i ) : class-conditional
More informationMeasuring Social Influence Without Bias
Measuring Social Influence Without Bias Annie Franco Bobbie NJ Macdonald December 9, 2015 The Problem CS224W: Final Paper How well can statistical models disentangle the effects of social influence from
More informationntopic Organic Traffic Study
ntopic Organic Traffic Study 1 Abstract The objective of this study is to determine whether content optimization solely driven by ntopic recommendations impacts organic search traffic from Google. The
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationSpecifying Latent Curve and Other Growth Models Using Mplus. (Revised )
Ronald H. Heck 1 University of Hawai i at Mānoa Handout #20 Specifying Latent Curve and Other Growth Models Using Mplus (Revised 12-1-2014) The SEM approach offers a contrasting framework for use in analyzing
More informationLinear Models 1. Isfahan University of Technology Fall Semester, 2014
Linear Models 1 Isfahan University of Technology Fall Semester, 2014 References: [1] G. A. F., Seber and A. J. Lee (2003). Linear Regression Analysis (2nd ed.). Hoboken, NJ: Wiley. [2] A. C. Rencher and
More informationExpectation-Maximization
Expectation-Maximization Léon Bottou NEC Labs America COS 424 3/9/2010 Agenda Goals Representation Capacity Control Operational Considerations Computational Considerations Classification, clustering, regression,
More informationSemi-parametric predictive inference for bivariate data using copulas
Semi-parametric predictive inference for bivariate data using copulas Tahani Coolen-Maturi a, Frank P.A. Coolen b,, Noryanti Muhammad b a Durham University Business School, Durham University, Durham, DH1
More informationModel Estimation Example
Ronald H. Heck 1 EDEP 606: Multivariate Methods (S2013) April 7, 2013 Model Estimation Example As we have moved through the course this semester, we have encountered the concept of model estimation. Discussions
More informationCovariance function estimation in Gaussian process regression
Covariance function estimation in Gaussian process regression François Bachoc Department of Statistics and Operations Research, University of Vienna WU Research Seminar - May 2015 François Bachoc Gaussian
More informationChapter 9. Non-Parametric Density Function Estimation
9-1 Density Estimation Version 1.2 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least
More informationSYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions
SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu
More informationGraph Detection and Estimation Theory
Introduction Detection Estimation Graph Detection and Estimation Theory (and algorithms, and applications) Patrick J. Wolfe Statistics and Information Sciences Laboratory (SISL) School of Engineering and
More informationGlossary. The ISI glossary of statistical terms provides definitions in a number of different languages:
Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the
More informationIntroduction to Maximum Likelihood Estimation
Introduction to Maximum Likelihood Estimation Eric Zivot July 26, 2012 The Likelihood Function Let 1 be an iid sample with pdf ( ; ) where is a ( 1) vector of parameters that characterize ( ; ) Example:
More informationApproximate Bayesian Computation
Approximate Bayesian Computation Michael Gutmann https://sites.google.com/site/michaelgutmann University of Helsinki and Aalto University 1st December 2015 Content Two parts: 1. The basics of approximate
More informationLecture 3. Truncation, length-bias and prevalence sampling
Lecture 3. Truncation, length-bias and prevalence sampling 3.1 Prevalent sampling Statistical techniques for truncated data have been integrated into survival analysis in last two decades. Truncation in
More informationSTATS 200: Introduction to Statistical Inference. Lecture 29: Course review
STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout
More informationMODEL BASED CLUSTERING FOR COUNT DATA
MODEL BASED CLUSTERING FOR COUNT DATA Dimitris Karlis Department of Statistics Athens University of Economics and Business, Athens April OUTLINE Clustering methods Model based clustering!"the general model!"algorithmic
More informationA Note on the Expectation-Maximization (EM) Algorithm
A Note on the Expectation-Maximization (EM) Algorithm ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign March 11, 2007 1 Introduction The Expectation-Maximization
More informationInteraction effects for continuous predictors in regression modeling
Interaction effects for continuous predictors in regression modeling Testing for interactions The linear regression model is undoubtedly the most commonly-used statistical model, and has the advantage
More informationStatistical Estimation
Statistical Estimation Use data and a model. The plug-in estimators are based on the simple principle of applying the defining functional to the ECDF. Other methods of estimation: minimize residuals from
More informationCOM336: Neural Computing
COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk
More informationIE 303 Discrete-Event Simulation
IE 303 Discrete-Event Simulation 1 L E C T U R E 5 : P R O B A B I L I T Y R E V I E W Review of the Last Lecture Random Variables Probability Density (Mass) Functions Cumulative Density Function Discrete
More informationPractical Considerations Surrounding Normality
Practical Considerations Surrounding Normality Prof. Kevin E. Thorpe Dalla Lana School of Public Health University of Toronto KE Thorpe (U of T) Normality 1 / 16 Objectives Objectives 1. Understand the
More information15-388/688 - Practical Data Science: Basic probability. J. Zico Kolter Carnegie Mellon University Spring 2018
15-388/688 - Practical Data Science: Basic probability J. Zico Kolter Carnegie Mellon University Spring 2018 1 Announcements Logistics of next few lectures Final project released, proposals/groups due
More informationECE 275A Homework 7 Solutions
ECE 275A Homework 7 Solutions Solutions 1. For the same specification as in Homework Problem 6.11 we want to determine an estimator for θ using the Method of Moments (MOM). In general, the MOM estimator
More informationLossless Online Bayesian Bagging
Lossless Online Bayesian Bagging Herbert K. H. Lee ISDS Duke University Box 90251 Durham, NC 27708 herbie@isds.duke.edu Merlise A. Clyde ISDS Duke University Box 90251 Durham, NC 27708 clyde@isds.duke.edu
More informationSummary and discussion of The central role of the propensity score in observational studies for causal effects
Summary and discussion of The central role of the propensity score in observational studies for causal effects Statistics Journal Club, 36-825 Jessica Chemali and Michael Vespe 1 Summary 1.1 Background
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationLinear Regression. Aarti Singh. Machine Learning / Sept 27, 2010
Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X
More informationMore on Unsupervised Learning
More on Unsupervised Learning Two types of problems are to find association rules for occurrences in common in observations (market basket analysis), and finding the groups of values of observational data
More informationHierarchical Modeling for Univariate Spatial Data
Hierarchical Modeling for Univariate Spatial Data Geography 890, Hierarchical Bayesian Models for Environmental Spatial Data Analysis February 15, 2011 1 Spatial Domain 2 Geography 890 Spatial Domain This
More informationSubject CS1 Actuarial Statistics 1 Core Principles
Institute of Actuaries of India Subject CS1 Actuarial Statistics 1 Core Principles For 2019 Examinations Aim The aim of the Actuarial Statistics 1 subject is to provide a grounding in mathematical and
More informationDiscriminant Analysis and Statistical Pattern Recognition
Discriminant Analysis and Statistical Pattern Recognition GEOFFRY J. McLACHLAN The University of Queensland @EEC*ENCE A JOHN WILEY & SONS, INC., PUBLICATION This Page Intentionally Left Blank Discriminant
More informationSVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning
SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More informationIEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm
IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.
More informationModeling Multiscale Differential Pixel Statistics
Modeling Multiscale Differential Pixel Statistics David Odom a and Peyman Milanfar a a Electrical Engineering Department, University of California, Santa Cruz CA. 95064 USA ABSTRACT The statistics of natural
More informationSequential Monte Carlo methods for filtering of unobservable components of multidimensional diffusion Markov processes
Sequential Monte Carlo methods for filtering of unobservable components of multidimensional diffusion Markov processes Ellida M. Khazen * 13395 Coppermine Rd. Apartment 410 Herndon VA 20171 USA Abstract
More informationReview. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda
Review DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall16 Carlos Fernandez-Granda Probability and statistics Probability: Framework for dealing with
More informationROBUST TESTS BASED ON MINIMUM DENSITY POWER DIVERGENCE ESTIMATORS AND SADDLEPOINT APPROXIMATIONS
ROBUST TESTS BASED ON MINIMUM DENSITY POWER DIVERGENCE ESTIMATORS AND SADDLEPOINT APPROXIMATIONS AIDA TOMA The nonrobustness of classical tests for parametric models is a well known problem and various
More information9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering
Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make
More informationThe impact of covariance misspecification in multivariate Gaussian mixtures on estimation and inference
The impact of covariance misspecification in multivariate Gaussian mixtures on estimation and inference An application to longitudinal modeling Brianna Heggeseth with Nicholas Jewell Department of Statistics
More informationPerformance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore
More informationA Derivation of the EM Updates for Finding the Maximum Likelihood Parameter Estimates of the Student s t Distribution
A Derivation of the EM Updates for Finding the Maximum Likelihood Parameter Estimates of the Student s t Distribution Carl Scheffler First draft: September 008 Contents The Student s t Distribution The
More informationAdvanced Statistical Methods. Lecture 6
Advanced Statistical Methods Lecture 6 Convergence distribution of M.-H. MCMC We denote the PDF estimated by the MCMC as. It has the property Convergence distribution After some time, the distribution
More informationStatistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation
Statistics - Lecture One Charlotte Wickham wickham@stat.berkeley.edu http://www.stat.berkeley.edu/~wickham/ Outline 1. Basic ideas about estimation 2. Method of Moments 3. Maximum Likelihood 4. Confidence
More informationSurvival Analysis for Case-Cohort Studies
Survival Analysis for ase-ohort Studies Petr Klášterecký Dept. of Probability and Mathematical Statistics, Faculty of Mathematics and Physics, harles University, Prague, zech Republic e-mail: petr.klasterecky@matfyz.cz
More information