Appendix. Bayesian Information Criterion (BIC) E Proof of Proposition 1. Compositional Kernel Search Algorithm. Base Kernels.
|
|
- Florence Ray
- 5 years ago
- Views:
Transcription
1 Hyunjik Ki and Yee Whye Teh Appendix A Bayesian Inforation Criterion (BIC) The BIC is a odel selection criterion that is the arginal likelihood with a odel coplexity penalty: BIC = log p(y ˆθ) 1 2 p log(n) for observations y, nuber of observations N, axiu likelihood estiate (MLE) of odel hyperparaeters ˆθ, nuber of hyperparaeters p. It is derived as an approxiation to the log odel evidence log p(y). B Copositional Kernel Search Algorith Algorith 2: Copositional Kernel Search Algorith Input: data x 1,..., x n R D, y 1,..., y n R,base kernel set B Output: k, the resulting kernel For each base kernel on each diension, fit GP to data (i.e. optiise hyperparas by ML-II) and set k to be kernel with highest BIC. for depth=1:t (either fix T or repeat until BIC no longer increases) do Fit GP to following kernels and set k to be the one with highest BIC: (1) All kernels of for k + B where B is any base kernel on any diension (2) All kernels of for k B where B is any base kernel on any diension (3) All kernels where a base kernel in k is replaced by another base kernel C D Base Kernels LIN(x, x ) = σ 2 (x l)(x l) ( SE(x, x ) = σ 2 exp (x x ) 2 ) 2l 2 ( PER(x, x ) = σ 2 exp 2 sin2 (π(x x )/p) ) l 2 Matrix Identities Lea 1 (Woodbury s Matrix Inversion Lea). (A + UBV ) 1 = A 1 A 1 U(B 1 + V A 1 U) 1 V A 1 So setting A = σ 2 I (Nyströ) or σ 2 I + diag(k ˆK) (FIC) or σ 2 I + blockdiag(k ˆK) (PIC), U = Φ T = V, B = I, we get: (A + Φ Φ) 1 = A 1 A 1 Φ (I + ΦA 1 Φ ) 1 ΦA 1 Lea 2 (Sylvester s Deterinant Theore). det(i + AB) = det(i + BA) A R n B R n Hence: det(σ 2 I + Φ Φ) = (σ 2 ) n det(i + σ 2 Φ Φ) = (σ 2 ) n det(i + σ 2 ΦΦ ) = (σ 2 ) n det(σ 2 I + ΦΦ ) E Proof of Proposition 1 Proof. If PCG converges, the upper bound for NIP is. We showed in Section 4.1 that the convergence happened in only a few iterations. Moreover Cortes et al [7] shows that the lower bound for NIP can be rather loose in general. So it suffices to prove that the upper bound for NLD is tighter than the lower bound for NLD. Let (λ i ) N, (ˆλ i ) N be the ordered eigenvalues of K + σ 2 I, ˆK + σ 2 I respectively. Since K ˆK is positive sei-definite (e.g. [3]), we have λ i ˆλ i 2σ 2 i (using the assuption in the proposition). Now the slack in the upper bound is: 1 2 log det( ˆK + σ 2 I) ( 1 2 log det(k + σ2 I)) = 1 2 (log λ i log ˆλ i ) Hence the slack in the lower bound is: 1 2 log det(k + σ2 I) [ 1 2 log det( ˆK + σ 2 I) 1 ] Tr(K ˆK) 2σ2 = 1 (log λ i log 2 ˆλ i ) + 1 2σ 2 (λ i ˆλ i ) Now by concavity and onotonicity of log, and since ˆλ 2σ 2, we have: 1 2 log λ i log ˆλ i λ i ˆλ i 1 2σ 2 (log λ i log ˆλ i ) 1 2σ 2 (log λ i log ˆλ i ) 1 2σ 2 (λ i ˆλ i ) 1 2 (λ i ˆλ i ) (log λ i log ˆλ i )
2 Scaling up the Autoatic Statistician: Scalable Structure Discovery using Gaussian Processes F Convergence of hyperparaeters fro optiising lower bound to optial hyperparaeters Note fro Figure 7 that the hyperparaeters found by optiising the lower bound converges to the hyperparaeters found by the GP when optiising the arginal likelihood, giving epirical evidence for the second clai in Section 3.3. G Parallelising SKC Note that SKC can be parallelised across the rando hyperparaeter initialisations, and also across the kernels at each depth for coputing the BIC intervals. In fact, SKC is even ore parallelisable with the kernel buffer: say at a certain depth, we have two kernels reaining to be optiised and evaluated before we can ove onto the next depth. If the buffer size is 5, say, then we can in fact ove on to the next depth and grow the kernel search tree on the top 3 kernels of the buffer, without having to wait for the 2 kernel evaluations to be coplete. This saves a lot of coputation tie wasted by idle cores waiting for all kernel evaluations to finish before oving on to the next depth of the kernel search tree. H Optiisation Since we wish to use the learned kernels for interpretation, it is iportant to have the hyperparaeters lie in a sensible region after the optiisation. In other words, we wish to regularise the hyperparaeters during optiisation. For exaple, we want the SE kernel to learn a globally sooth function with local variation. When naïvely optiising the lower bound, soeties the length scale and the signal variance becoes very sall, so the SE kernel explains all the variation in the signal and ends up connecting the dots. We wish to avoid this type of behaviour. This can be achieved by giving priors to hyperparaeters and optiising the energy (log prior added to the log arginal likelihood) instead, as well as using sensible initialisations. Looking at Figure 8, we see that using a strong prior with a sensible rando initialisation (see Appendix I for details) gives a sensible soothly varying function, whereas for all the three other cases, we have the length scale and signal variance shrinking to sall values, causing the GP to overfit to the data. Note that the weak prior is the default prior used in the GPstuff software [39]. Careful initialisation of hyperparaeters and inducing points is also very iportant, and can have strong influence the resulting optia. It is sensible to have the optiised hyperparaeters of the parent kernel in the search tree be inherited and used to initialise the hyperparaeters of the child. The new hyperparaeters of the child ust be initialised with rando restarts, where the variance is sall enough to ensure that they lie in a sensible region, but large enough to explore a good portion of this region. As for the inducing points, we want to spread the out to capture both local and global structure. Trying both K-eans and a rando subset of training data, we conclude that they give siilar results and resort to a rando subset. Moreover we also have the option of learning the inducing points. However, this will be considerably ore costly and show little iproveent over fixing the, as we show in Section 4. Hence we do not learn the inducing points, but fix the to a given randoly chosen set. Hence for SKC, we use axiu a posteriori (MAP) estiates instead of MLE for the hyperparaeters to calculate the BIC, since the priors have a noticeable effect on the optiisation. This is justified [23] and has been used for exaple in [11], where they argue that using the MLE to estiate the BIC for Gaussian ixture odels can fail due to singularities and degeneracies. I Hyperparaeter initialisation and priors Z N (, 1), T N(σ 2, I) is a Gaussian with ean and variance σ 2 truncated at the interval I then renoralised. Signal noise σ 2 =.1 exp(z/2) p(log σ 2 ) = N (,.2) LIN σ 2 = exp(v ) where V T N(1, [, ]), l = exp( Z 2 ) p(log σ 2 ) = logunif p(log l) = logunif SE l = exp(z/2), σ 2 =.1 exp(z/2) p(log l) = N (,.1), p(log σ 2 ) = logunif PER p in = log(1 ax(x) in(x) N ) (shortest possible period is 1 tie steps) p ax = log( ax(x) in(x) 5 ) (longest possible period is a fifth of the range of data set) l = exp(z/2), p = exp(p in + W ) or exp(p ax + U), σ 2 1 =.1 exp(z/2) w.p. 2 where W T N (.5, [, ]),
3 Hyunjik Ki and Yee Whye Teh Figure 7: Log arginal likelihood and hyperparaeter values after optiising the lower bound with ARD kernel on a subset of the Power plant data for different values of. This is copared against the GP values when optiising the true log arginal likelihood. Error bars show ean ± 1 standard deviation over 1 rando iterations W/ data strong prior, rand init weak prior, rand init strong prior, sall init weak prior, sall init year Figure 8: GP predictions on solar data set with SE kernel for different priors and initialisations. U T N (.5, [, ])) p(log l) = t(µ =, σ 2 = 1, ν = 4), p(log p) = LN (p in.5,.25) or LN (p ax 2,.5) 1 w.p. 2 p(log σ 2 ) = logunif where LN (µ, σ 2 ) is log Gaussian, t(µ, σ 2, ν) is the student s t-distribution. J Coputation ties Look at Table 2. For all three data sets, the GP optiisation tie is uch greater than the su of the Var GP optiisation tie and the upper bound (NLD + NIP) evaluation tie for 8. Hence the savings in coputation tie for SKC is significant even for these sall data sets. Note that we show the lower bound optiisation tie against the upper bound evaluation tie instead of the evaluation ties for both, since this is what happens in SKC - the lower bound has to be optiised for each kernel, whereas the upper bound only has to be evaluated once. K Mauna and Solar plots and hyperparaeter values found by SKC Solar The solar data has 26 cycles over 285 years, which gives a periodicity of around years. Using SKC with = 4, we find the kernel: SE PER SE SE PER. The value of the period hyperparaeter in PER is years, hence SKC finds the periodicity to 3 s.f. with only 4 inducing points. The SE ter converts the global periodicity to local periodicity, with the extent of the locality governed by the length scale paraeter in SE, equal to 45. This is fairly large, but saller than the range of the doain ( ), indicating that the periodicity spans over a long tie but isn t quite global. This is ost likely due to the static solar irradiance between the years , adding a bit of noise to the periodicities. Mauna The annual periodicity in the data and the linear trend with positive slope is clear. Linear regression gives us a slope of SKC with = 4 gives the kernel: SE + PER + LIN. The period hyperparaeter in PER takes value 1, hence SKC successfully finds the right periodicity. The offset l and agnitude σ 2 paraeters of LIN allow us to calculate the slope
4 Scaling up the Autoatic Statistician: Scalable Structure Discovery using Gaussian Processes Table 2: Mean and standard deviation of coputation ties (in seconds) for full GP optiisation, Var GP optiisation(with and without learning inducing points), NLD and NIP (PCG using PIC preconditioner) upper bounds over 1 rando iterations. Solar Mauna Concrete GP ± ± ± Var GP = ± ± ±.7298 = ± ± ± = ± ± ± = ± ± ± = ± ± ± = ± ± ± NLD =1.19 ±.1.33 ± ±.4 =2.26 ±.1.46 ± ±.7 =4.43 ±.1.79 ± ±.5 =8.84 ± ± ±.12 = ± ± ±.3 = ± ± ±.74 NIP =1.474 ± ± ±.26 =2.422 ± ± ±.45 =4.284 ± ± ±.483 =8.199 ± ± ±.376 =16.26 ± ± ±.422 =32.25 ± ± ±.433 Var GP, = ± ± ± 32.5 learn IP = ± ± ± 97. = ± ± ± = ± ± ± 41. = ± ± ± 46.9 = ± ± ± 82.9 by the forula σ 2 (x l) [σ 2 (x l)(x l) + σ 2 ni] 1 y where σ 2 n is the noise variance in the learned likelihood. This forula is obtained fro the posterior ean of the GP, which is linear in the inputs for the linear kernel. This value aounts to 1.515, hence the slope found by SKC is accurate to 3 s.f. L Optiising the upper bound If the upper bound is tighter and ore robust with respect to choice of inducing points, why don t we optiise the upper bound to find hyperparaeters? If this were to be possible then we can axiise this to get an upper bound of the arginal likelihood with optial hyperparaeters. In fact this holds for any analytic upper bound whose value and gradients can be evaluated cheaply. Hence for any, we can find an interval that contains the true optiised arginal likelihood. So if this interval is doinated by an interval of another kernel, we can discard the kernel and there is no need to evaluate the bounds for bigger values of. Now we wish to use values of such that we can choose the right kernel (or kernels) at each depth of the search tree with inial coputation. This gives rise to an exploitation-exploration trade-off, whereby we want to balance between raising for tight intervals that allow us to discard unsuitable kernels whose intervals fall strictly below that of other kernels, and quickly oving on to the next depth in the search tree to search for finer structure in the data. The search algorith is highly parallelisable, and thus we ay raise siultaneously for all candidate kernels. At deeper levels of the search tree, there ay be too any candidates for siultaneous coputation, in which case we ay select the ones with the highest upper bound to get tighter intervals. Such attepts are listed below. Note the two inequalities for the NLD and NIP ters: 1 2 log det( ˆK + σ 2 I) 1 Tr(K ˆK) 2σ2 1 2 log det(k + σ2 I) 1 2 y ( ˆK + σ 2 I) 1 y 1 2 y (K+σ 2 I) 1 y 1 2 log det( ˆK + σ 2 I) (5) 1 2 y (K + (σ 2 + Tr(K ˆK))I) 1 y (6) Where the first two inequalities coe fro [3], the
5 Hyunjik Ki and Yee Whye Teh solar irradiance to get the upper bound 1 2 log det( ˆK +σ 2 I) 1 2 y (K +(σ 2 +Tr(K ˆK))I) 1 y However we found that axiising this bound gives quite a loose upper bound unless = O(N). Hence this upper bound is not very useful M Rando Fourier Features CO2 level year (a) Solar year (b) Mauna Figure 9: Plots of sall tie series data: Solar and Mauna Rando Fourier Features (RFF) (a.k.a. Rando Kitchen Sinks) was introduced by [26] as a low rank approxiation to the kernel atrix. It uses the following theore Theore 3 (Bochner s Theore [29]). A stationary kernel k(d) is positive definite if and only if k(d) is the Fourier transfor of a non-negative easure. to give an unbiased low-rank approxiation to the Gra atrix K = E[Φ Φ] with Φ R N. A bigger lowers the variance of the estiate. Using this approxiation, one can copute deterinants and inverses in O(N 2 ) tie. In the context of kernel coposition in 2, RFFs have the nice property that saples fro the spectral density of the su or product of kernels can easily be obtained as sus or ixtures of saples of the individual kernels (see Appendix N). We use this later to give a eory-efficient upper bound on the log arginal likelihood in Appendix P. third inequality is a direct consequence of K ˆK being postive sei-definite, and the last inequality is fro Michalis Titsias lecture slides 3. Also fro 4, we have that 1 2 log det( ˆK + σ 2 I) α (K + σ 2 I)α α y is an upper bound α R N. Thus one idea of obtaining a cheap upper bound to the optiised arginal likelihood was to solve the following axiin optiisation proble: ax θ in 1 α R N 2 log det( ˆK+σ 2 I)+ 1 2 α (K+σ 2 I)α α y One way to solve this cheaply would be by coordinate descent, where one axiises with respect to θ fixing α, then iniises with respect to α fixing θ. However σ tends to blow up in practice. This is because the expression is O( log σ 2 + σ 2 ) for fixed α, hence axiising with respect to σ pushes it towards infinity. N Rando Features for Sus and Products of Kernels For RFF the kernel can be approxiated by the inner product of rando features given by saples fro its spectral density, in a Monte Carlo approxiation, as follows: k(x y) = e iv (x y) dp(v) R D p(v)e iv (x y) dv R D where φ(x) = = E p(v) [e iv x (e iv y ) ] = E p(v) [Re(e iv x (e iv y ) )] 1 Re(e iv x k (e iv y k ) ) k=1 = E b,v [φ(x) φ(y)] 2 (cos(v 1 x+b 1 ),..., cos(v x+b )) An alternative is to su the two upper bounds above with spectral frequencies v k iid saples fro p(v) and b k iid saples fro U[, 2π]. 3 Let k 1, k 2 be two stationary kernels, with respective
6 Scaling up the Autoatic Statistician: Scalable Structure Discovery using Gaussian Processes spectral densities p 1, p 2 so that k 1 (d) = a 1 ˆp 1 (d), k 2 (d) = a 2 ˆp 2 (d), where ˆp(d) := p(v)e iv d dv. We use this convention as the Fourier R D transfor. Note a i = k i (). (k 1 + k 2 )(d) = a 1 p 1 (v)e iv d dv + a 2 p 2 (v)e iv d dv = (a 1 + a 2 ) p ˆ + (d) where p + (v) = a1 a 1+a 2 p 1 (v) + a2 a 1+a 2 p 2 (v), a ixture of p 1 and p 2. So to generate RFF for k 1 + k 2, generate v p + by generating v p 1 w.p. 1 a a 1+a 2 and v a p 2 w.p. 2 a 1+a 2 Now for the product, suppose (k 1 k 2 )(d) = a 1 a 2 ˆp 1 (d) ˆp 2 (d) = a 1 a 2 ˆp (d) Then p (d) is the inverse fourier transfor of ˆp 1 ˆp 2, which is the convolution p 1 p 2 (d) := p R D 1 (z)p 2 (d z)dz. So to generate RFF for k 1 k 2, generate v p by generating v 1 p 1, v 2 p 2 and setting v = v 1 + v 2. This is not applicable for non-stationary kernels, such as the linear kernel. We deal with this proble as follows: Suppose φ 1, φ 2 are rando features such that k 1 (x, x ) = φ 1 (x) φ 1 (x ), φ 2 (x) φ 2 (x ), φ i : R D R. It is straightforward to verify that P An upper bound to NLD using Rando Fourier Features Note that the function f(x) = log det(x) is convex on the set of positive definite atrices [6]. Hence by Jensen s inequality we have, for Φ Φ an unbiased estiate of K: 1 2 log det(k + σ2 I) = f(k + σ 2 I) = f(e[φ Φ + σ 2 I]) E[f(Φ Φ + σ 2 I)] Hence 1 2 log det(φ Φ + σ 2 I) is a stochastic upper bound to NLD that can be calculated in O(N 2 ). An exaple of such an unbiased estiator Φ is given by RFF. Q Further Plots (k 1 + k 2 )(x, x ) = φ + (x) φ + (x ) (k 1 k 2 )(x, x ) = φ (x) φ (x ) where φ + ( ) = (φ 1 ( ), φ 2 ( ) ) and φ ( ) = φ 1 ( ) φ 2 ( ). However we do not want the nuber of features to grow as we add or ultiply kernels, since it will grow exponentially. We want to keep it to be features. So we subsaple entries fro φ + (or φ ) and scale by factor 2 ( for φ ), which will still give us unbiased estiates of the kernel provided that each ter of the inner product φ + (x) φ + (x ) (or φ (x) φ (x )) is an unbiased estiate of (k 1 + k 2 )(x, x )(or (k 1 k 2 )(x, x )). This is how we generate rando features for linear kernels cobined with other stationary kernels, using the features φ(x) = σ (x,..., x). O Spectral Density for PER Fro [35], we have that the spectral density of the PER kernel is: I n (l 2 ( ) exp(l 2 ) δ v 2πn ) p n= where I is the odified Bessel function of the first kind.
7 Hyunjik Ki and Yee Whye Teh negative log ML energy fullgp Var LB UB NLD Nystro NLD UB RFF NLD UB NIP (a) Mauna: fix inducing points CG PCG Nys PCG FIC PCG PIC negative log energy ML fullgp Var LB UB NLD Nystro NLD UB RFF NLD UB NIP (b) Mauna: learn inducing points Figure 1: Sae as 1a and 1b but for Mauna Loa data. CG PCG Nys PCG FIC PCG PIC negative log ML energy -6-8 fullgp Var LB UB NLD Nystro NLD UB RFF NLD UB NIP (a) Concrete: fix inducing points CG PCG Nys PCG FIC PCG PIC negative log ML energy fullgp Var LB UB NLD Nystro NLD UB 2 RFF NLD UB NIP CG PCG Nys -55 PCG FIC PCG PIC 1-6 (b) Concrete: learn inducing points Figure 11: Sae as 1a and 1b but for Concrete data. SE PER LIN (SE)*SE (SE)*PER (SE)+SE (PER)+SE (PER)+PER (PER)*PER (PER)+LIN (LIN)+LIN (LIN)+SE (PER)*LIN (LIN)*LIN (SE)*LIN Figure 12: Kernel search tree for SKC on solar data up to depth 2. We show the upper and lower bounds for different nubers of inducing points.
8 Scaling up the Autoatic Statistician: Scalable Structure Discovery using Gaussian Processes Figure 13: Sae as Figure 12 but for Mauna data. Figure 14: Sae as Figure 12 but for concrete data and up to depth 1.
arxiv: v2 [stat.ml] 14 Feb 2018 Abstract
Scaling up the Autoatic Statistician: Scalable Structure Discovery using Gaussian Processes arxiv:176.2524v2 [stat.ml] 14 Feb 218 Abstract Autoating statistical odelling is a challenging proble in artificial
More informationMachine Learning Basics: Estimators, Bias and Variance
Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics
More information1 Bounding the Margin
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost
More informationCOS 424: Interacting with Data. Written Exercises
COS 424: Interacting with Data Hoework #4 Spring 2007 Regression Due: Wednesday, April 18 Written Exercises See the course website for iportant inforation about collaboration and late policies, as well
More informationCS Lecture 13. More Maximum Likelihood
CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood
More informationScaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processes
Scaling up the Autoatic Statistician: Scalable Structure Discovery using Gaussian Processes Hyunjik Ki and Yee Whye Teh University of Oxford, DeepMind {hki,y.w.teh}@stats.ox.ac.uk Abstract Autoating statistical
More informationFeature Extraction Techniques
Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that
More information13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices
CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay
More informationQuantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search
Quantu algoriths (CO 781, Winter 2008) Prof Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search ow we begin to discuss applications of quantu walks to search algoriths
More informationBlock designs and statistics
Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent
More informationDetection and Estimation Theory
ESE 54 Detection and Estiation Theory Joseph A. O Sullivan Sauel C. Sachs Professor Electronic Systes and Signals Research Laboratory Electrical and Systes Engineering Washington University 11 Urbauer
More informationKernel Methods and Support Vector Machines
Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic
More informationE0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis
E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds
More informationUsing EM To Estimate A Probablity Density With A Mixture Of Gaussians
Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points
More informationPseudo-marginal Metropolis-Hastings: a simple explanation and (partial) review of theory
Pseudo-arginal Metropolis-Hastings: a siple explanation and (partial) review of theory Chris Sherlock Motivation Iagine a stochastic process V which arises fro soe distribution with density p(v θ ). Iagine
More informationCh 12: Variations on Backpropagation
Ch 2: Variations on Backpropagation The basic backpropagation algorith is too slow for ost practical applications. It ay take days or weeks of coputer tie. We deonstrate why the backpropagation algorith
More informationBootstrapping Dependent Data
Bootstrapping Dependent Data One of the key issues confronting bootstrap resapling approxiations is how to deal with dependent data. Consider a sequence fx t g n t= of dependent rando variables. Clearly
More informationPAC-Bayes Analysis Of Maximum Entropy Learning
PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E
More informationProbability Distributions
Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples
More informationNon-Parametric Non-Line-of-Sight Identification 1
Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,
More informationSharp Time Data Tradeoffs for Linear Inverse Problems
Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used
More informationIntelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines
Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes
More informationA Theoretical Analysis of a Warm Start Technique
A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful
More informationA Note on the Applied Use of MDL Approximations
A Note on the Applied Use of MDL Approxiations Daniel J. Navarro Departent of Psychology Ohio State University Abstract An applied proble is discussed in which two nested psychological odels of retention
More informationCombining Classifiers
Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/
More informationList Scheduling and LPT Oliver Braun (09/05/2017)
List Scheduling and LPT Oliver Braun (09/05/207) We investigate the classical scheduling proble P ax where a set of n independent jobs has to be processed on 2 parallel and identical processors (achines)
More informationA Simple Regression Problem
A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where
More informationModel Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon
Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential
More informationSoft-margin SVM can address linearly separable problems with outliers
Non-linear Support Vector Machines Non-linearly separable probles Hard-argin SVM can address linearly separable probles Soft-argin SVM can address linearly separable probles with outliers Non-linearly
More information1 Generalization bounds based on Rademacher complexity
COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges
More informationStochastic Subgradient Methods
Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods
More informationBoosting with log-loss
Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the
More informationTopic 5a Introduction to Curve Fitting & Linear Regression
/7/08 Course Instructor Dr. Rayond C. Rup Oice: A 337 Phone: (95) 747 6958 E ail: rcrup@utep.edu opic 5a Introduction to Curve Fitting & Linear Regression EE 4386/530 Coputational ethods in EE Outline
More informationSupport Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab
Support Vector Machines Machine Learning Series Jerry Jeychandra Bloh Lab Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a
More informationComputable Shell Decomposition Bounds
Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October
More informationBayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)
Bayesian Learning Chapter 6: Bayesian Learning CS 536: Machine Learning Littan (Wu, TA) [Read Ch. 6, except 6.3] [Suggested exercises: 6.1, 6.2, 6.6] Bayes Theore MAP, ML hypotheses MAP learners Miniu
More information3.3 Variational Characterization of Singular Values
3.3. Variational Characterization of Singular Values 61 3.3 Variational Characterization of Singular Values Since the singular values are square roots of the eigenvalues of the Heritian atrices A A and
More informationTesting equality of variances for multiple univariate normal populations
University of Wollongong Research Online Centre for Statistical & Survey Methodology Working Paper Series Faculty of Engineering and Inforation Sciences 0 esting equality of variances for ultiple univariate
More informationCSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13
CSE55: Randoied Algoriths and obabilistic Analysis May 6, Lecture Lecturer: Anna Karlin Scribe: Noah Siegel, Jonathan Shi Rando walks and Markov chains This lecture discusses Markov chains, which capture
More informationHomework 3 Solutions CSE 101 Summer 2017
Hoework 3 Solutions CSE 0 Suer 207. Scheduling algoriths The following n = 2 jobs with given processing ties have to be scheduled on = 3 parallel and identical processors with the objective of iniizing
More information1 Proof of learning bounds
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a
More informationChapter 6 1-D Continuous Groups
Chapter 6 1-D Continuous Groups Continuous groups consist of group eleents labelled by one or ore continuous variables, say a 1, a 2,, a r, where each variable has a well- defined range. This chapter explores:
More informationSupport Vector Machines. Maximizing the Margin
Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the
More informationChaotic Coupled Map Lattices
Chaotic Coupled Map Lattices Author: Dustin Keys Advisors: Dr. Robert Indik, Dr. Kevin Lin 1 Introduction When a syste of chaotic aps is coupled in a way that allows the to share inforation about each
More informationSupport Vector Machines MIT Course Notes Cynthia Rudin
Support Vector Machines MIT 5.097 Course Notes Cynthia Rudin Credit: Ng, Hastie, Tibshirani, Friedan Thanks: Şeyda Ertekin Let s start with soe intuition about argins. The argin of an exaple x i = distance
More informationOn Constant Power Water-filling
On Constant Power Water-filling Wei Yu and John M. Cioffi Electrical Engineering Departent Stanford University, Stanford, CA94305, U.S.A. eails: {weiyu,cioffi}@stanford.edu Abstract This paper derives
More informationComputational and Statistical Learning Theory
Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic
More informationIn this chapter, we consider several graph-theoretic and probabilistic models
THREE ONE GRAPH-THEORETIC AND STATISTICAL MODELS 3.1 INTRODUCTION In this chapter, we consider several graph-theoretic and probabilistic odels for a social network, which we do under different assuptions
More informationThe proofs of Theorem 1-3 are along the lines of Wied and Galeano (2013).
A Appendix: Proofs The proofs of Theore 1-3 are along the lines of Wied and Galeano (2013) Proof of Theore 1 Let D[d 1, d 2 ] be the space of càdlàg functions on the interval [d 1, d 2 ] equipped with
More informationThe Weierstrass Approximation Theorem
36 The Weierstrass Approxiation Theore Recall that the fundaental idea underlying the construction of the real nubers is approxiation by the sipler rational nubers. Firstly, nubers are often deterined
More informationLecture 21. Interior Point Methods Setup and Algorithm
Lecture 21 Interior Point Methods In 1984, Kararkar introduced a new weakly polynoial tie algorith for solving LPs [Kar84a], [Kar84b]. His algorith was theoretically faster than the ellipsoid ethod and
More informationPattern Recognition and Machine Learning. Artificial Neural networks
Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016/2017 Lessons 9 11 Jan 2017 Outline Artificial Neural networks Notation...2 Convolutional Neural Networks...3
More informationEE5900 Spring Lecture 4 IC interconnect modeling methods Zhuo Feng
EE59 Spring Parallel LSI AD Algoriths Lecture I interconnect odeling ethods Zhuo Feng. Z. Feng MTU EE59 So far we ve considered only tie doain analyses We ll soon see that it is soeties preferable to odel
More informationComputational and Statistical Learning Theory
Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher
More informationComputable Shell Decomposition Bounds
Journal of Machine Learning Research 5 (2004) 529-547 Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago
More informationlecture 36: Linear Multistep Mehods: Zero Stability
95 lecture 36: Linear Multistep Mehods: Zero Stability 5.6 Linear ultistep ethods: zero stability Does consistency iply convergence for linear ultistep ethods? This is always the case for one-step ethods,
More informationPHY 171. Lecture 14. (February 16, 2012)
PHY 171 Lecture 14 (February 16, 212) In the last lecture, we looked at a quantitative connection between acroscopic and icroscopic quantities by deriving an expression for pressure based on the assuptions
More informationAnalyzing Simulation Results
Analyzing Siulation Results Dr. John Mellor-Cruey Departent of Coputer Science Rice University johnc@cs.rice.edu COMP 528 Lecture 20 31 March 2005 Topics for Today Model verification Model validation Transient
More informationThis model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.
CS 493: Algoriths for Massive Data Sets Feb 2, 2002 Local Models, Bloo Filter Scribe: Qin Lv Local Models In global odels, every inverted file entry is copressed with the sae odel. This work wells when
More informationLower Bounds for Quantized Matrix Completion
Lower Bounds for Quantized Matrix Copletion Mary Wootters and Yaniv Plan Departent of Matheatics University of Michigan Ann Arbor, MI Eail: wootters, yplan}@uich.edu Mark A. Davenport School of Elec. &
More informationTracking using CONDENSATION: Conditional Density Propagation
Tracking using CONDENSATION: Conditional Density Propagation Goal Model-based visual tracking in dense clutter at near video frae rates M. Isard and A. Blake, CONDENSATION Conditional density propagation
More informationWhen Short Runs Beat Long Runs
When Short Runs Beat Long Runs Sean Luke George Mason University http://www.cs.gu.edu/ sean/ Abstract What will yield the best results: doing one run n generations long or doing runs n/ generations long
More informationA remark on a success rate model for DPA and CPA
A reark on a success rate odel for DPA and CPA A. Wieers, BSI Version 0.5 andreas.wieers@bsi.bund.de Septeber 5, 2018 Abstract The success rate is the ost coon evaluation etric for easuring the perforance
More informationSupport Vector Machines. Goals for the lecture
Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed
More informationIntelligent Systems: Reasoning and Recognition. Artificial Neural Networks
Intelligent Systes: Reasoning and Recognition Jaes L. Crowley MOSIG M1 Winter Seester 2018 Lesson 7 1 March 2018 Outline Artificial Neural Networks Notation...2 Introduction...3 Key Equations... 3 Artificial
More informationProc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES
Proc. of the IEEE/OES Seventh Working Conference on Current Measureent Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES Belinda Lipa Codar Ocean Sensors 15 La Sandra Way, Portola Valley, CA 98 blipa@pogo.co
More informationLecture October 23. Scribes: Ruixin Qiang and Alana Shine
CSCI699: Topics in Learning and Gae Theory Lecture October 23 Lecturer: Ilias Scribes: Ruixin Qiang and Alana Shine Today s topic is auction with saples. 1 Introduction to auctions Definition 1. In a single
More informationSPECTRUM sensing is a core concept of cognitive radio
World Acadey of Science, Engineering and Technology International Journal of Electronics and Counication Engineering Vol:6, o:2, 202 Efficient Detection Using Sequential Probability Ratio Test in Mobile
More informationSupport Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization
Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering
More informationC na (1) a=l. c = CO + Clm + CZ TWO-STAGE SAMPLE DESIGN WITH SMALL CLUSTERS. 1. Introduction
TWO-STGE SMPLE DESIGN WITH SMLL CLUSTERS Robert G. Clark and David G. Steel School of Matheatics and pplied Statistics, University of Wollongong, NSW 5 ustralia. (robert.clark@abs.gov.au) Key Words: saple
More information2 Q 10. Likewise, in case of multiple particles, the corresponding density in 2 must be averaged over all
Lecture 6 Introduction to kinetic theory of plasa waves Introduction to kinetic theory So far we have been odeling plasa dynaics using fluid equations. The assuption has been that the pressure can be either
More informationPrincipal Components Analysis
Principal Coponents Analysis Cheng Li, Bingyu Wang Noveber 3, 204 What s PCA Principal coponent analysis (PCA) is a statistical procedure that uses an orthogonal transforation to convert a set of observations
More informationA Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)
1 A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine (1900 words) Contact: Jerry Farlow Dept of Matheatics Univeristy of Maine Orono, ME 04469 Tel (07) 866-3540 Eail: farlow@ath.uaine.edu
More informationDERIVING PROPER UNIFORM PRIORS FOR REGRESSION COEFFICIENTS
DERIVING PROPER UNIFORM PRIORS FOR REGRESSION COEFFICIENTS N. van Erp and P. van Gelder Structural Hydraulic and Probabilistic Design, TU Delft Delft, The Netherlands Abstract. In probles of odel coparison
More informationarxiv: v1 [cs.ds] 3 Feb 2014
arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/
More informationKeywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution
Testing approxiate norality of an estiator using the estiated MSE and bias with an application to the shape paraeter of the generalized Pareto distribution J. Martin van Zyl Abstract In this work the norality
More informationExperimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis
City University of New York (CUNY) CUNY Acadeic Works International Conference on Hydroinforatics 8-1-2014 Experiental Design For Model Discriination And Precise Paraeter Estiation In WDS Analysis Giovanna
More informationPattern Recognition and Machine Learning. Artificial Neural networks
Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lessons 7 20 Dec 2017 Outline Artificial Neural networks Notation...2 Introduction...3 Key Equations... 3 Artificial
More informationThis article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and
This article appeared in a ournal published by Elsevier. The attached copy is furnished to the author for internal non-coercial research and education use, including for instruction at the authors institution
More informatione-companion ONLY AVAILABLE IN ELECTRONIC FORM
OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer
More informationA Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness
A Note on Scheduling Tall/Sall Multiprocessor Tasks with Unit Processing Tie to Miniize Maxiu Tardiness Philippe Baptiste and Baruch Schieber IBM T.J. Watson Research Center P.O. Box 218, Yorktown Heights,
More informationPh 20.3 Numerical Solution of Ordinary Differential Equations
Ph 20.3 Nuerical Solution of Ordinary Differential Equations Due: Week 5 -v20170314- This Assignent So far, your assignents have tried to failiarize you with the hardware and software in the Physics Coputing
More informationTEST OF HOMOGENEITY OF PARALLEL SAMPLES FROM LOGNORMAL POPULATIONS WITH UNEQUAL VARIANCES
TEST OF HOMOGENEITY OF PARALLEL SAMPLES FROM LOGNORMAL POPULATIONS WITH UNEQUAL VARIANCES S. E. Ahed, R. J. Tokins and A. I. Volodin Departent of Matheatics and Statistics University of Regina Regina,
More informationma x = -bv x + F rod.
Notes on Dynaical Systes Dynaics is the study of change. The priary ingredients of a dynaical syste are its state and its rule of change (also soeties called the dynaic). Dynaical systes can be continuous
More information3.8 Three Types of Convergence
3.8 Three Types of Convergence 3.8 Three Types of Convergence 93 Suppose that we are given a sequence functions {f k } k N on a set X and another function f on X. What does it ean for f k to converge to
More informationOcean 420 Physical Processes in the Ocean Project 1: Hydrostatic Balance, Advection and Diffusion Answers
Ocean 40 Physical Processes in the Ocean Project 1: Hydrostatic Balance, Advection and Diffusion Answers 1. Hydrostatic Balance a) Set all of the levels on one of the coluns to the lowest possible density.
More informationBiostatistics Department Technical Report
Biostatistics Departent Technical Report BST006-00 Estiation of Prevalence by Pool Screening With Equal Sized Pools and a egative Binoial Sapling Model Charles R. Katholi, Ph.D. Eeritus Professor Departent
More informationTraining an RBM: Contrastive Divergence. Sargur N. Srihari
Training an RBM: Contrastive Divergence Sargur N. srihari@cedar.buffalo.edu Topics in Partition Function Definition of Partition Function 1. The log-likelihood gradient 2. Stochastic axiu likelihood and
More informationLecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon
Lecture 2: Enseble Methods Isabelle Guyon guyoni@inf.ethz.ch Introduction Book Chapter 7 Weighted Majority Mixture of Experts/Coittee Assue K experts f, f 2, f K (base learners) x f (x) Each expert akes
More informationSolutions of some selected problems of Homework 4
Solutions of soe selected probles of Hoework 4 Sangchul Lee May 7, 2018 Proble 1 Let there be light A professor has two light bulbs in his garage. When both are burned out, they are replaced, and the next
More informationAutomated Frequency Domain Decomposition for Operational Modal Analysis
Autoated Frequency Doain Decoposition for Operational Modal Analysis Rune Brincker Departent of Civil Engineering, University of Aalborg, Sohngaardsholsvej 57, DK-9000 Aalborg, Denark Palle Andersen Structural
More informationConsistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material
Consistent Multiclass Algoriths for Coplex Perforance Measures Suppleentary Material Notations. Let λ be the base easure over n given by the unifor rando variable (say U over n. Hence, for all easurable
More informationGraphical Models in Local, Asymmetric Multi-Agent Markov Decision Processes
Graphical Models in Local, Asyetric Multi-Agent Markov Decision Processes Ditri Dolgov and Edund Durfee Departent of Electrical Engineering and Coputer Science University of Michigan Ann Arbor, MI 48109
More informationa a a a a a a m a b a b
Algebra / Trig Final Exa Study Guide (Fall Seester) Moncada/Dunphy Inforation About the Final Exa The final exa is cuulative, covering Appendix A (A.1-A.5) and Chapter 1. All probles will be ultiple choice
More informationBayes Decision Rule and Naïve Bayes Classifier
Bayes Decision Rule and Naïve Bayes Classifier Le Song Machine Learning I CSE 6740, Fall 2013 Gaussian Mixture odel A density odel p(x) ay be ulti-odal: odel it as a ixture of uni-odal distributions (e.g.
More informationA note on the multiplication of sparse matrices
Cent. Eur. J. Cop. Sci. 41) 2014 1-11 DOI: 10.2478/s13537-014-0201-x Central European Journal of Coputer Science A note on the ultiplication of sparse atrices Research Article Keivan Borna 12, Sohrab Aboozarkhani
More informationOBJECTIVES INTRODUCTION
M7 Chapter 3 Section 1 OBJECTIVES Suarize data using easures of central tendency, such as the ean, edian, ode, and idrange. Describe data using the easures of variation, such as the range, variance, and
More informationAn improved self-adaptive harmony search algorithm for joint replenishment problems
An iproved self-adaptive harony search algorith for joint replenishent probles Lin Wang School of Manageent, Huazhong University of Science & Technology zhoulearner@gail.co Xiaojian Zhou School of Manageent,
More informationPattern Recognition and Machine Learning. Artificial Neural networks
Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016 Lessons 7 14 Dec 2016 Outline Artificial Neural networks Notation...2 1. Introduction...3... 3 The Artificial
More information