Spike train entropy-rate estimation using hierarchical Dirichlet process priors

Size: px

Start display at page:

Download "Spike train entropy-rate estimation using hierarchical Dirichlet process priors"

Harold Hopkins
5 years ago
Views:

1 publised in: Advances in Neural Information Processing Systems 26 (23), Spike train entropy-rate estimation using ierarcical Diriclet process priors Karin Knudson Department of Matematics Jonatan W. Pillow Center for Perceptual Systems Departments of Psycology & Neuroscience Te University of Texas at Austin Abstract Entropy rate quantifies te amount of disorder in a stocastic process. For spiking neurons, te entropy rate places an upper bound on te rate at wic te spike train can convey stimulus information, and a large literature as focused on te problem of estimating entropy rate from spike train data. Here we present Bayes least squares and empirical Bayesian entropy rate estimators for binary spike trains using ierarcical Diriclet process (HDP) priors. Our estimator leverages te fact tat te entropy rate of an ergodic Markov Cain wit known transition probabilities can be calculated analytically, and many stocastic processes tat are non-markovian can still be well approximated by Markov processes of sufficient dept. Coosing an appropriate dept of Markov model presents callenges due to possibly long time dependencies and sort data sequences: a deeper model can better account for long time dependencies, but is more difficult to infer from limited data. Our approac mitigates tis difficulty by using a ierarcical prior to sare statistical power across Markov cains of different depts. We present bot a fully Bayesian and empirical Bayes entropy rate estimator based on tis model, and demonstrate teir performance on simulated and real neural spike train data. Introduction Te problem of caracterizing te statistical properties of a spiking neuron is quite general, but two interesting questions one migt ask are: () wat kind of time dependencies are present? and (2) ow muc information is te neuron transmitting? Wit regard to te second question, information teory provides quantifications of te amount of information transmitted by a signal witout reference to assumptions about ow te information is represented or used. Te entropy rate is of interest as a measure of uncertainty per unit time, an upper bound on te rate of information transmission, and an intermediate step in computing mutual information rate between stimulus and neural response. Unfortunately, accurate entropy rate estimation is difficult, and estimates from limited data are often severely biased. We present a Bayesian metod for estimating entropy rates from binary data tat uses ierarcical Diriclet process priors (HDP) to reduce tis bias. Our metod proceeds by modeling te source of te data as a Markov cain, and ten using te fact tat te entropy rate of a Markov cain is a deterministic function of its transition probabilities. Fitting te model yields parameters relevant to bot questions () and (2) above: we obtain bot an approximation of te underlying stocastic process as a Markov cain, and an estimate of te entropy rate of te process. For binary data, te HDP reduces to a ierarcy of beta priors, were te prior probability over g, te probability of te next symbol given a long istory, is a beta distribution centered on te probability of tat symbol given a truncated, one-symbol-sorter, istory. Te posterior over symbols given a certain istory is tus smooted by te probability over symbols given a sorter istory. Tis smooting is a key feature of te model.

2 Te structure of te paper is as follows. In Section 2, we present definitions and callenges involved in entropy rate estimation, and discuss existing estimators. In Section 3, we discuss Markov models and teir relationsip to entropy rate. In Sections 4 and 5, we present two Bayesian estimates of entropy rate using te HDP prior, one involving a direct calculation of te posterior mean transition probabilities of a Markov model, te oter using Markov Cain Monte Carlo metods to sample from te posterior distribution of te entropy rate. In Section 6 we compare te HDP entropy rate estimators to existing entropy rate estimators including te context tree weigting entropy rate estimator from [], te string-parsing metod from [2], and finite-lengt block entropy rate estimators tat makes use of te entropy estimator of Nemenman, Bialek and Safee [3] and Miller and Madow [4]. We evaluate te results for simulated and real neural data. 2 Entropy Rate Estimation In information teory, te entropy of a random variable is a measure of te variable s average unpredictability. Te entropy of a discrete random variable X wit possible values {x,..., x n } is H(X) = n p(x i ) log(x i ) () i= Entropy can be measured in eiter nats or bits, depending on weter we use base 2 or e for te logaritm. Here, all logaritms discussed will be base 2, and all entropies will be given in bits. Wile entropy is a property of a random variable, entropy rate is a property of a stocastic process, suc as a time series, and quantifies te amount of uncertainty per symbol. Te neural and simulated data considered ere will be binary sequences representing te spike train of a neuron, were eac symbol represents eiter te presence of a spike in a bin () or te absence of a spike (). We view te data as a sample pat from an underlying stocastic process. To evaluate te average uncertainty of eac new symbol ( or ) given te previous symbols - or te amount of new information per symbol - we would like to compute te entropy rate of te process. For a stocastic process {X i } i= te entropy of te random vector (X,..., X k ) grows wit k; we are interested in ow it grows. If we define te block entropy H k to be te entropy of te distribution of lengt-k sequences of symbols, H k = H(X i+,...x i+k ), ten te entropy rate of a stocastic process {X i } i=is defined by = lim k k H k (2) wen te limit exists (wic, for stationary stocastic processes, it must). Tere are two oter definitions for entropy rate, wic are equivalent to te first for stationary processes: = = lim k+ H k k (3) lim i+ X i, X i,...x i k ) k (4) We now briefly review existing entropy rate estimators, to wic we will compare our results. 2. Block Entropy Rate Estimators Since muc work as been done to accurately estimate entropy from data, Equations (2) and (3) suggest a simple entropy rate estimator, wic consists of coosing first a block size k and ten a suitable entropy estimator wit wic to estimate H k. A simple suc estimator is te plugin entropy estimator, wic approximates te probability of eac lengt-k block (x,..., x k ) by te proportion of total lengt-k blocks observed tat are equal to (x,..., x k ). For binary data tere are 2 k possible lengt-k blocks. Wen N denotes te data lengt and c i te number of observations of eac block in te data, we ave: Ĥ plugin = 2 k i= c i N log c i N (5) 2

3 from wic we can immediately estimate te entropy rate wit plugin,k = Ĥplugin /k, for some appropriately cosen k (te subject of appropriate coice will be taken up in more detail later). We would expect tat using better block entropy estimators would yield better entropy rate estimators, and so we also consider two oter block based entropy rate estimators. Te first uses te Bayesian entropy estimator H NSB from Nemenman, Safee and Bialek [3], wic gives a Bayesian least squares estimate for entropy given a mixture-of-diriclet prior. Te second uses te Miller and Madow estimator [4], wic gives a first-order correction to te (often significantly biased) plugin entropy estimator of Equation 5: Ĥ MM = 2 k i= c i N log c i N + A log(e) (6) 2N were A is te size of te alpabet of symbols (A = 2 for te binary data sequences presently considered). For a given k, we obtain entropy rate estimators NSB,k = ĤNSB/k and MM,k = ĤMM /k by applying te entropy estimators from [3] and [4] respectively to te empirical distribution of te lengt-k blocks. Wile we can improve te accuracy of tese block entropy rate estimates by coosing a better entropy estimator, coosing te block size k remains a callenge. If we coose k to be small, we miss long time dependencies in te data and tend to overestimate te entropy; intuitively, te time series will seem more unpredictable tan it actually is, because we are ignoring long-time dependencies. On te oter and, as we consider larger k, limited data leads to underestimates of te entropy rate. See te plots of plugin, NSB, and MM in Figure 2d for an instance of tis effect of block size on entropy rate estimates. We migt ope tat in between te overestimates of entropy rate for sort blocks and te te underestimates for longer blocks, tere is some plateau region were te entropy rate stays relatively constant wit respect to block size, wic we could use as a euristic to select te proper block lengt []. Unfortunately, te block entropy rate at tis plateau may still be biased, and for data sequences tat are sort wit respect to teir time dependencies, tere may be no discernible plateau at all ([], Figure ). 2.2 Oter Entropy Rate Estimators Not all existing tecniques for entropy rate estimation involve an explicit coice of block lengt. Te estimator from [2], for example, parses te full string of symbols in te data by starting from te first symbol, and sequentially removing and counting as a prase te sortest substring tat as not yet appeared. Wen M is te number of distinct prases counted in tis way, we obtain te estimator: LZ = M N log N, free from any explicit block lengt parameters. A fixed block lengt model like te ones described in te previous section uses te entropy of te distribution of all te blocks of a some lengt - e.g. all te blocks in te terminal nodes of a context tree like te one in Figure a. In te context tree weigting (CTW) framework of [], te autors instead use a minimum descriptive lengt criterion to weigt different tree topologies, wic ave witin te same tree terminal nodes corresponding to blocks of different lengts. Tey use tis weigting to generate Monte Carlo samples and approximate te integral (θ)p(θ T, data)p(t data) dθ dt, in wic T represents te tree topology, and θ represents transition probabilities associated wit te terminal nodes of te tree. In our approac, te HDP prior combined wit a Markov model of our data will be a key tool in overcoming some of te difficulties of coosing a block-lengt appropriately for entropy rate estimation. It will allow us to coose a block lengt tat is large enoug to capture possibly important long time dependencies, wile easing te difficulty of estimating te properties of tese long time dependencies from sort data. 3 Markov Models Te usefulness of approximating our data source wit a Markov model comes from () te flexibility of Markov models including teir ability to well approximate even many processes tat are not truly Markovian, and (2) te fact tat for a Markov cain wit known transition probabilities te entropy rate need not be estimated but is in fact a deterministic function of te transition probabilities. 3

4 Figure : A dept-3 ierarcical Diriclet prior for binary data A Markov cain is a sequence of random variables tat as te property tat te probability of te next state depends only on te present state, and not on any previous states. Tat is, P (X i+ X i,..., X ) = P (X i+ X i ). Note tat tis property does not mean tat for a binary sequence te probability of eac or depends only on te previous or, because we consider te state variables to be strings of symbols of lengt k rater tan individual s and s, Tus we will discuss dept-k Markov models, were te probability of te next state depends only previous k symbols, or wat we will call te lengt-k context of te symbol. Wit a binary alpabet, tere are 2 k states te cain can take, and from eac state s, transitions are possible only to two oter states. (So tat for, example, can transition to state or state, but not to any oter state). Because only two transitions are possible from eac state, te transition probability distribution from eac s is completely specified by only one parameter, wic we denote g s, te probability of observing a given te context s. Te entropy rate of an ergodic Markov cain wit finite state set A is given by: = s A p(s)h(x s), (7) were p(s) is te stationary probability associated wit state s, and H(x s) is te entropy of te distribution of possible transitions from state s. Te vector of stationary state probabilities p(s) for all s is computed as a left eigenvector of te transition matrix T: p(s)t = p(s), s p(s) = (8) Since eac row of te transition matrix T contains only two non-zero entries, g s, and g s, p(s) can be calculated relatively quickly. Wit equations 7 and 8, can be calculated analytically from te vector of all 2 k transition probabilities {g s }. A Bayesian estimator of entropy rate based on a Markov model of order k is given by ĥ Bayes = (g)p(g data)dg (9) were g = {g s : s = k}, is te deterministic function of g given by Equations 7 and 8, and p(g data) p(data g)p(g) given some appropriate prior over g. Modeling a time series as a Markov cain requires a coice of te dept of tat cain, so we ave not avoided te dept selection problem yet. Wat will actually mitigate te difficulty ere is te use of ierarcical Diriclet process priors. 4 Hierarcical Diriclet Process priors We describe a ierarcical beta prior, a special case of te ierarcical Diriclet process (HDP), wic was presented in [5] and applied to problems of natural language processing in [6] and [7]. Te true entropy rate = lim k H k /k captures time dependencies of infinite dept. Terefore to calculate te estimate ĥbayes in Equation 9 we would like to coose some large k. However, it is difficult to estimate transition probabilities for long blocks wit sort data sequences, so coosing large k may lead to inaccurate posterior estimates for te transition probabilities g. In particular, 4

5 sorter data sequences may not even ave observations of all possible symbol sequences of a given lengt. Tis motivates our use of ierarcical priors as follows. Suppose we ave a data sequence in wic te subsequence is never observed. Ten we would not expect to ave a very good estimate for g ; owever, we could improve tis by using te assumption tat, a priori, g sould be similar to g. Tat is, te probability of observing a after te context sequence sould be similar to tat of seeing a after, since it migt be reasonable to assume tat context symbols from te more distant past matter less. Tus we coose for our prior: g s g s Beta(α s g s, α s ( g s )) () were s denotes te context s wit te earliest symbol removed. Tis coice gives te prior distribution of g s mean g s, as desired. We continue constructing te prior wit g s g s Beta(α s g s, α s ( g s )) and so on until g [] Beta(α p, α ( p )) were g [] is te probability of a spike given no context information and p is a yperparameter reflecting our prior belief about te probability of a spike. Tis ierarcy gives our prior te tree structure as sown in in Figure. A priori, te distribution of eac transition probability is centered around te transition probability from a one-symbol-sorter block of symbols. As long as te assumption tat more distant contextual symbols matter less actually olds (at least to some degree), tis structure allows te saring of statistical information across different contextual depts. We can obtain reasonable estimates for te transition probabilities from long blocks of symbols, even from data tat is so sort tat we may ave few (or no) observations of eac of tese long blocks of symbols. We could use any number of distributions wit mean g s to center te prior distribution of g s at g s ; we use Beta distributions because tey are conjugate to te likeliood. Te α s are concentration parameters wic control ow concentrated te distribution is about its mean, and can also be estimated from te data. We assume tat tere is one value of α for eac level in te ierarcy, but one could also fix alpa to be constant trougout all levels, or let it vary witin eac level. Tis ierarcy of beta distributions is a special case of te ierarcical Diriclet process. A Diriclet process (DP) is a stocastic process wose sample pats are eac probability distributions. Formally, if G is a finite measure on a set S, ten X DP (α, G) if for any finite measurable partition of te sample space (A,...A n ) we ave tat X(A ),...X(A n ) Diriclet(αG(A ),..., αg(a n )). Tus for a partition into only two sets, te Diriclet process reduces to a beta distribution, wic is wy wen we specialize te HDP to binary data, we obtain a ierarcical beta distribution. In [5] te autors present a ierarcy of DPs were te base measure for eac DP is again a DP. In our case, for example, we ave G = {g, g } DP (α 3, G ), or more generally, G s DP (α s, G s ). 5 Empirical Bayesian Estimator One can generate a sequence from an HDP by drawing eac subsequent symbol from te transition probability distribution associated wit its context, wic is given recursively by [6] : p( s) = { cs α s +c s + α s c α + α +N α p +N α s +c s p( s ) if s if s = were N is te lengt of te data string, p is a yperparameter representing te a prior probability of observing a given no contextual information, c s is te number of times te symbol sequence s followed by a was observed, and c s is te number of times te symbol sequence s was observed. We can calculate te posterior predictive distribution ĝ pr wic is specified by te 2 k values {g s = p( s) : s = k} by using counts c from te data and performing te above recursive calculation to estimate g s for eac of te 2 k states s. Given te estimated Markov transition probabilities ĝ pr we ten ave an empirical Bayesian entropy rate estimate via Equations 7 and 8. We denote tis estimator emphdp. Note tat wile ĝ pr is te posterior mean of te transition probabilities, te entropy rate estimator emphdp is no longer a fully Bayesian estimate, and is not equivalent to te ĥbayes of equation 9. We tus lose some clarity and te ability to easily compute Bayesian confidence intervals. However, we gain a good deal of computational efficiency because calculating emphdp from ĝ pr involves only one eigenvector computation, instead of te many needed for te MC approximation to te integral in Equation 9. We present a fully Bayesian estimate next. () 5

6 6 Fully Bayesian Estimator Here we return to te Bayes least squares estimator ĥbayes of Equation 9. Te integral is not analytically tractable, but we can approximate it using Markov Cain Monte Carlo tecniques. We use Gibbs sampling to simulate N MC samples g (i) g data from te posterior distribution and ten calculate (i) from eac g (i) via Equations 7 and 8 to obtain te Bayesian estimate: HDP = N MC (i) (2) N MC To perform te Gibbs sampling, we need te posterior conditional probabilities of eac g s. Because te parameters of te model ave te structure of a tree, eac g s for s < k is conditionally independent from all but its immediate ancestor in te tree, g s, and its two descendants, g s and g s. We ave: p(g s g s, g s, g s.α s, α s = ) Beta(g s ; α s g s, α s ( g s ))Beta(g s ; α s + g s, α s + ( g s )) i= Beta(g s ; α s + g s, α s + ( g s )) and we can compute tese probabilities on a discrete grid since tey are eac one dimensional, ten sample te posterior g s via tis grid. We used a uniform grid of points on te interval [,] for our computation. For te transition probabilities from te bottom level of te tree {g s : s = k}, te conjugacy of te beta distributions wit binomial likeliood function gives te posterior conditional of g s a recognizable form: p(g s g s, data) = Beta(α k g s + c s, α k ( g s ) + c s ). In te HDP model we may treat eac α as a fixed yperparameter, but it is also straigtforward to set a prior over eac α and ten sample α along wit te oter model parameters wit eac pass of te Gibbs sampler. Te full posterior conditional for α i wit a uniform prior is (from Bayes teorem): p(α i g s, g s, g s : s = i ) {s: s =i } (3) (g s g s ) αigs (( g s )( g s )) αi( gs) Beta(α i g s, α i ( g s )) 2 (4) We sampled α by computing te probabilities above on a grid of values spanning te range [, 2]. Tis upper bound on α is rater arbitrary, but we verified tat increasing te range for α ad little effect on te entropy rate estimate, at least for te ranges and block sizes considered. In some applications, te Markov transition probabilities g, and not just te entropy rate, may be of interest as a description of te time dependencies present in te data. Te Gibbs sampler above yields samples from te distribution g data, and averaging tese N MC samples yields a Bayes least squares estimator of transition probabilities, ĝ gibbshdp. Note tat tis estimate is closely related to te estimate ĝ pr from te previous section; wit more MC samples, ĝ gibbshdp converges to te posterior mean ĝ pr (wen te α are fixed rater tan sampled, to matc te fixed α per level used in Equation ). 7 Results We applied te model to bot simulated data wit a known entropy rate and to neural data, were te entropy rate is unknown. We examine te accuracy of te fully Bayesian and empirical Bayesian entropy rate estimators HDP and emphdp, and compare te entropy rate estimators plugin, NSB, MM, LZ [2], and CT W [], wic are described in Section 2. We also consider estimates of te Markov transition probabilities g produced by bot inference metods. 7. Simulation We considered data simulated from a Markov model wit transition probabilities set so tat transition probabilities from states wit similar suffixes are similar (i.e. te process actually does ave te property tat more distant context symbols matter less tan more recent ones in determining transitions). We used a dept-5 Markov model, wose true transition probabilities are sown in black in 6

7 (a) p( s) (b) Data Lengt (c) Absolute Error (d) Data Lengt.9.7 true NSB MM plugin LZ ctw emphdp HDP Block Lengt Figure 2: Comparison of estimated (a) transition probability and (b,c,d) entropy rate for data simulated from a Markov model of dept 5. In (a) and (d), data sets are 5 symbols long. Te block-based and HDP estimators in (b) and (c) use block size k = 8. In (b,c,d) results were averaged over 5 data sequences, and (c) plots te average absolute value of te difference between true and estimated entropy rates. Figure 2a, were eac of te 32 points on te x axis represents true te probability tat te next symbol NSB is a.9 given te specified 5-symbol context. MM In Figure 2a we compare HDP estimates of transition plugin probabilities of tis simulated data to te plugin estimator of transition probabilities ĝ LZ s = cs c s calculated ctw from a 5-symbol sequence. (Te oter estimators do not include calculating transition probabilities emphdp as an intermediate step, and so.7 cannot be2 included ere.) 2 Wit a series of 5 symbols, HDP we do not expect enoug observations of Block Lengt eac of possible transitions to adequately estimate te 2 k transition probabilities, even for rater modest depts suc as k = 5. And indeed, te plugin estimates of transition probabilities do not matc te true transition probabilities well. On te oter and, te transition probabilities estimated using te HDP prior sow te kind of smooting te prior was meant to encourage, were states corresponding to contexts wit same suffixes ave similar estimated transition probabilities. Lastly, we plot te convergence of te entropy rate estimators wit increased lengt of te data sequence and te associated error in Figures 2b,c. If te true dept of te model is no larger tan te dept k considered in te estimators, all te estimators considered sould converge. We see in Figure 2c tat te HDP-based entropy rate estimates converge quickly wit increasing data, relative to oter models. Te motivation of te ierarcical prior was to allow observations of transitions from sorter contexts to inform estimates of transitions from longer contexts. Tis, it was oped, would mitigate te drop-off wit larger block-size seen in block-entropy based entropy rate estimators. Figure 2d indicates tat for simulated data tat is indeed te case, altoug we do see some bias te fully Bayesian entropy rate estimator for large block lengts. Te empirical Bayes and and fully Bayesian entropy rate estimators wit HDP priors produce estimates tat are close to te true entropy rate across a wider range of block-size. 7.2 Neural Data We applied te same analysis to neural spike train data collected from primate retinal ganglion cells stimulated wit binary full-field movies refresed at Hz [8]. In tis case, te true transition probabilities are unknown (and indeed te process may not be exactly Markovian). However, we calculate te plug-in transition probabilities from a longer data sequence (67, bins) so tat te estimates are approximately converged (black trace in Figure 3a), and note tat transition probabilities from contexts wit te same most-recent context symbols do appear to be similar. Tus te estimated transition probabilities reflect te idea tat more distant context cues matter less, and te smooting of te HDP prior appears to be appropriate for tis neural data. Te true entropy rate is also unknown, but again we estimate it using te plugin estimator on a large data set. We again note te relatively fast convergence of HDP and emphdp in Figures 3b,c, and te long plateau of te estimators in Figure 3d indicating te relative stability of te HDP entropy rate estimators wit respect to coice of model dept. 7

8 (a) p( s) (b) 2 4 Data Lengt (c) Absolute Error Data Lengt (d) converged NSB MM plugin LZ ctw emphdp HDP Block Lengt Figure 3: Comparison of estimated (a) transition probability and (b,c,d) entropy rate for neural data. Te converged estimates are calculated from 7s of data wit 4ms bins (67, symbols). In (a) and (d), training data sequences are 5 symbols (2s) long. Te block-based and HDP estimators in (b) and (c) use block size k = 8. In (b,c,d), results were averaged over 5 data sequences sampled randomly from te full dataset. 8 Discussion LZ. to take into account long time dependencies ctw and small enoug relative to te data at and to avoid a severe downward bias of te estimate. emphdp We ave approaced tis problem by modeling te data as 2 a Markov cain Data Lengt and estimating transition HDP probabilities using a ierarcical prior tat links transition Mean Absolute Error We ave presented two estimators converged of te entropy rate of a spike train or arbitrary binary sequence..3 Te true entropy rate of a stocastic NSB process involves consideration of infinitely long time depen- MM dencies. To make entropy rateestimation tractable, one can try to fix a maximum dept of time plugin dependencies to be considered, but it is difficult to coose an appropriate dept tat is large enoug Estimated Entropy Rate probabilities from longer contexts to transition probabilities from sorter contexts. Tis allowed us to coose a large dept even in te presence of limited data, since te structure of te prior allowed observations of transitions from sorter contexts (of wic we ave many instances in te data) to inform estimates of transitions from longer contexts (of wic we may ave only a few instances) We presented 2 Block bot Lengta fully Bayesian estimator, wic Data Lengtallows for Bayesian confidence intervals, and an empirical Bayesian estimator, wic provides computational efficiency. Bot estimators sow excellent performance on simulated and neural data in terms of teir robustness to te coice of model dept, teir accuracy on sort data sequences, and teir convergence wit increased data. Bot metods of entropy rate estimation also yield estimates of te transition probabilities wen te data is modeled as a Markov cain, parameters wic may be of interest in te own rigt as descriptive of te statistical structure and time dependencies in a spike train. Our results indicate tat tools from modern Bayesian nonparametric statistics old great promise for revealing te structure of neural spike trains despite te callenges of limited data. Acknowledgments We tank V. J. Uzzell and E. J. Cicilnisky for retinal data. Tis work was supported by a Sloan Researc Fellowsip, McKnigt Scolar s Award, and NSF CAREER Award IIS

9 References [] Mattew B Kennel, Jonaton Slens, Henry DI Abarbanel, and EJ Cicilnisky. Estimating entropy rates wit bayesian confidence intervals. Neural Computation, 7(7):53 576, 25. [2] Abraam Lempel and Jacob Ziv. On te complexity of finite sequences. Information Teory, IEEE Transactions on, 22():75 8, 976. [3] Ilya Nemenman, Fariel Safee, and William Bialek. Entropy and inference, revisited. arxiv preprint pysics/825, 2. [4] George Armitage Miller and William Gregory Madow. On te Maximum Likeliood Estimate of te Sannon-Weiner Measure of Information. Operational Applications Laboratory, Air Force Cambridge Researc Center, Air Researc and Development Command, Bolling Air Force Base, 954. [5] Yee Wye Te, Micael I Jordan, Mattew J Beal, and David M Blei. Hierarcical diriclet processes. Journal of te American Statistical Association, (476), 26. [6] Yee Wye Te. A ierarcical bayesian language model based on pitman-yor processes. In Proceedings of te 2st International Conference on Computational Linguistics and te 44t annual meeting of te Association for Computational Linguistics, pages Association for Computational Linguistics, 26. [7] Frank Wood, Cédric Arcambeau, Jan Gastaus, Lancelot James, and Yee Wye Te. A stocastic memoizer for sequence data. In Proceedings of te 26t Annual International Conference on Macine Learning, pages ACM, 29. [8] V. J. Uzzell and E. J. Cicilnisky. Precision of spike trains in primate retinal ganglion cells. Journal of Neuropysiology, 92:78 789, 24. 9

Regularized Regression

Regularized Regression David M. Blei Columbia University December 5, 205 Modern regression problems are ig dimensional, wic means tat te number of covariates p is large. In practice statisticians regularize