Student-t Process as Alternative to Gaussian Processes Discussion

Student-t Process as Alternative to Gaussian Processes Discussion A. Shah, A. G. Wilson and Z. Gharamani Discussion by: R. Henao Duke University June 20, 2014

Contributions The paper is concerned about the following aspects of Student-t processes: Definition and motivation of inverse Wishart processes (IWP). Propose a Student-t process (TP) derived from a hierarchical GP model. Show that predictive covariances of a TP depend on training observations. Show that TP is the most general elliptic symmetric process with analytic marginal and predictive distributions. Derive new sampling strategy for IWP. Show that an analytic TP noise model can separate signal from noise analytically. Empirically show non-trivial differences between GP and TP.

Inverse Wishart Distribution Definition (Dawid, 1981). Σ Π(n) has inverse Wishart distribution with parameters ν R +, K Π(n) and we write Σ IW n(ν,k) if its density is given by { p(σ) = c n(ν,k) K (ν+2n)/2 exp 1 } 2 trace(kσ 1 ), with Some properties: c n(ν,k) = When ν > 2, E[Σ] = (ν 2) 1 K. K (ν+n 1)/2 2 (ν+n 1)n/2 Γ n((ν +n 1)/2). Wishart and inverse Wishart distributions place prior mass on every Σ Π(n). Σ W n(ν,k) iff Σ 1 IW n(ν n+1,k 1 ). Dawid (1981) showed that the inverse Wishart distribution is closed under marginalization, so if Σ IW n(ν,k) then Σ 11 IW n1 (ν,k 11).

Inverse Wishart process Definition. σ is an inverse Wishart process on X with parameters ν R + and base kernel k θ : X X R if for any finite collection x 1,...,x n X: σ(x 1,...,x n) IW n(ν,k). K Π(n) with k ij = k θ (x i,x j). σ IWP(ν,k θ ). Generative model: where φ : X R. σ IWP(ν,k θ ), y σ GP(φ,(ν 2)σ) For data y = [y 1... y n] with φ = [φ(x 1)... φ(x n)] and Σ = σ(x 1,...,x n) p(y φ,ν,k) = p(y φ, Σ)p(Σ ν, K)dΣ ( 1+ 1 ) (ν+n)/2 ν 2 (y φ) K 1 (y φ)

Student-t process Definition. y R n has multivariate Student-t distribution with parameters ν R +\[0,2], φ R n and K Π(n) if it has density p(y) = ( Γ((ν +2)/2) (ν 2) n/2 π n/2 Γ(ν/2) K 1/2 1+ 1 ) (ν+n)/2 ν 2 (y φ) K 1 (y φ). We write y MVT n(ν,φ,k). Some properties: E[y] = E[y Σ] = φ. cov[y] = E[(y φ)(y φ) Σ] = E[(ν 2)Σ] = K. Lemma. MVT is closed under marginalization. Definition. f is a Student-t process on X with parameters ν > 2, mean function φ : X R, and kernel function k θ : X X R, if any finite collection of function values [f(x 1)... f(x n)] MVT n(ν,φ,k), where K Π(n) with k ij = k θ (x i,x j) and φ R n with φ i = φ(x i). We write f T P(ν,φ,k θ ).

Relation to Gaussian process GP is a special case of TP. Lemma. Suppose that f T P(ν,φ,k θ ) and g GP(φ,k θ ), then f tends to g in distribution as ν. ν controls the tails of the distribution.

Conditional distribution Lemma. Suppose y MVT n(ν,φ,k) and let y = [y 1 y 2], with y 1 R n 1 and y 2 R n 2, then ( ) y 2 y 1 MVT n2 ν +n 1, φ ν +β1 2 2, ν +n 1 2 K 22, where φ2 = K 21K 1 11 (y1 φ1)+φ2. β 1 = (y 1 φ) K 1 11 (y1 φ). K22 = K 22 K 21K 1 11 K12. E[y 2 y 1] = φ 2. cov[y 2 y 1] = ν+β 1 2 ν+n 1 2 K 22. The predictive covariance of y 2 depends on y 1 via β 1.

Another covariance prior Yu et al., 2007. Scale mixture of Gaussians construction r 1 Gamma(ν/2,ρ/2), y r N(φ,r(ν 2)K/ρ), where K Π(n), φ R n, ν > 2, ρ > 0 and marginally y MVT n(ν,φ,k). Besides ( r 1 ν +n y Gamma, ρ ( )) 1+ β1, 2 2 ν 2 hence E[r(ν 2)/ρ y] = ν+β 1 2 ν+n 1 2.

Elliptical processes Definition. y R n is elliptically symmetric iff there exists µ R n, R a non-negative random variable, Ω a n d matrix with maximal rank d and u uniformly distributed on the unit sphere in R d independent of R such that y D = µ+rωu. Lemma (Fang et al., 1989). Suppose R 1 χ 2 (n) and R 2 Gamma 1 (ν/2,1/2) independently. If R = R 1, then y is Gaussian distributed. If R = (ν 2)R 1R 2, then y is MVT distributed. Definition Let Y = {y i} be a countable family of random variables. It is an elliptical process if any finite subset of them are jointly elliptically symmetric. Theorem (Kelker, 1970). Suppose Y = {y i} is an elliptical process. Any finite collection z = {z 1,...,z n} Y has a density iff there exists a non-negative random variable r such that z r N(µ,rΩΩ ). Corollary. Suppose Y = {y i} is an elliptical process. Any finite collection z = {z 1,...,z n} Y has an analytically representable density iff Y is a Gaussian process or a Student-t process.

A New Way to Sample the IWP Theorem. Let Σ Π(n). Suppose {λ 1,...,λ n} are the eigenvalues of Σ. There exists Q Ξ(n) such that Σ = QΛQ, where Λ = diag(λ 1,...,λ n). Using the facts that Q Q = I and AB = BA p(σ)dσ = p(qλq ) J(Σ;Q,Λ) dλdq n λ ν+2n 2 i e 2λ 1 i λ i λ j n dλ i dq, i=1 thus Q is uniformly distributed over Ξ(n) and the λ i are exchangeable. We draw Σ = QΛQ (Dawid, 1977): Q Υ n,n. Λ Θ n(ν). j i

Modeling Noisy Functions A common practice for GP is y = f +ǫ, f GP(φ,k θ ), ǫ N(0,ψI). This model is tractable because Gaussian distributions are closed under addition. MVT is not closed under addition but we write k = k θ +ψδ. Empirically MVT n(ν,0,k)+mvt n(ν,0,ψi) MVT n(ν,0,k+ψi).

Experiments Regression: Squared exponential kernel function plus delta. Sampling with Hamiltonian Monte Carlo for θ. 2 artificial and 3 benchmark datasets. Performance measure: MSE and log-likelihood. Bayesian optimization: ARD Matérn kernel function plus delta. Sampling with slice sampling for θ. 3 benchmark functions. Performance measure: minimum function vale vs. function evaluations.