Computational Statistics. Jian Pei School of Computing Science Simon Fraser University

Size: px

Start display at page:

Download "Computational Statistics. Jian Pei School of Computing Science Simon Fraser University"

Marilynn Stevenson
5 years ago
Views:

1 Computational Statistics Jian Pei School of Computing Science Simon Fraser University

2 BASIC OPTIMIZATION METHODS J. Pei: Computational Statistics 2

3 Why Optimization? In statistical inference, we have to conduct maximum likelihood estimation Most functions cannot be analytically optimized Example: how to maximize? Set The equation does not have an analytic solution J. Pei: Computational Statistics 3

4 Problem Formulation Let l be the log likelihood function If is the maximum likelihood estimation (MLE), then it is a solution to the following score equation Where J. Pei: Computational Statistics 4

5 Example J. Pei: Computational Statistics 5

6 The Bisection Method Idea If g is continuous on [a 0, b 0 ] and g (a 0 )g (b 0 ) 0, then there exists at least one x* in [a 0, b 0 ] such that g (x*) = 0 Systematically shrink the interval from [a 0, b 0 ] to [a 1, b 1 ] to [a 2, b 2 ] and so on J. Pei: Computational Statistics 6

7 The Bisection Method Let be the starting value, then Where J. Pei: Computational Statistics 7

8 Example J. Pei: Computational Statistics 8

9 Stopping Rule Stop if the procedure appears to have achieved satisfactory convergence; or if it appears unlikely to do so soon The absolute convergence criterion For bisection, True error tolerance of when, that is holds J. Pei: Computational Statistics 9

10 Relative Convergence Criterion Criterion May not work well when x* is close to 0 Alternatively, The bisection method in theory guaranteed to converge to a root J. Pei: Computational Statistics 10

11 Stopping in Failure to Converge MLE may not be unique The convergence may not happen due to computational precision Stopping rules that flag a failure to converge Stop after N iterations Some convergence measures fail to decrease or cycle over several iterations J. Pei: Computational Statistics 11

12 Bracketing Methods Bound a root within a sequence of nested intervals of decreasing length Bisection is a bracketing method Slow! If g is continuous on [a 0, b 0 ], a root can be found Do not rely on the existence, behavior, or ease of deriving of g J. Pei: Computational Statistics 12

13 Newton-Raphson Iteration Assumption: g is continuously differentiable and g (x*) 0 At iteration t, approximate g (x*) by the linear Taylor series expansion That is, The updating equation: Where J. Pei: Computational Statistics 13

14 Example J. Pei: Computational Statistics 14

15 Newton s Method May Not Converge J. Pei: Computational Statistics 15

16 An Interesting Situation Suppose g is continuous and g (x*) 0 There exists a neighborhood of x* within which g (x) 0 for all x Define Using a Taylor expansion, for some q between x (t) and x* That is, That is, J. Pei: Computational Statistics 16

17 An Interesting Situation (cont d) Consider a neighborhood of x*, for δ > 0, Define Then,, i.e., Thus, we can choose a δ such that δc(δ) < 1 J. Pei: Computational Statistics 17

18 An Interesting Situation (cont d) If, then from We have Assuming a starting point satisfying J. Pei: Computational Statistics 18

19 Newton s Method Converges If g is continuous and x* is a simple root of g, then there exists a neighborhood of x* for which Newton s method converges to x* when started from any x (0) in that neighborhood A root is simple if it cannot be written as the sum of two roots Some other conditions where Newton s method converges exist J. Pei: Computational Statistics 19

20 Convergence Order Measuring how fast a root-finding method is A method has convergence of order β > 0 if It converges,, and Convergence speed c 0 is a constant J. Pei: Computational Statistics 20

21 Convergence Order of Newton s Method Recall Then, If Newton s method converges, then Newton s method has quadratic convergence, that is β = 2 J. Pei: Computational Statistics 21

22 Fisher Scoring Replace in the Newton update with is the Fisher information, and can be approximated by, is the expected Fisher information evaluated at The updating equation is J. Pei: Computational Statistics 22

23 Comparison Fisher scoring and Newton s method have the same asymptotic properties Fisher scoring works better in the beginning to make rapid improvements Newton s method works better for refinement near the end J. Pei: Computational Statistics 23

24 Secant Method Newton s method becomes inconvenient if g (x (t) ) is hard to calculate Use the discrete-difference approximation The updating equation becomes, for t > 0, J. Pei: Computational Statistics 24

25 Convergence Order The secant method has the convergence of order of The secant method has a slower order of convergence than Newton s method J. Pei: Computational Statistics 25

26 Example J. Pei: Computational Statistics 26

27 THE EM METHOD J. Pei: Computational Statistics 27

28 A Frequentist Setting Conceive of observed data θ generated from random variables X along with some missing or unobserved data from random variable Z Z is often called latent Envision complete data generated from Y = (X, Z) Given observed data x, we wish to maximize a likelihood L(θ x) often hard to work with Easier to deal with the densities of Y θ and Z (x, θ) EM uses the easier densities to estimate L(θ x) J. Pei: Computational Statistics 28

29 Latent Z Conceptually, Z may be viewed as having been removed from the complete Y through the application of some many-to-fewer mapping X = M(Y) Let f X (x θ) be the density of the observed data Let f Y (y θ) be the density of the complete data A marginalization model Observe X having density Conditional density of the missing data J. Pei: Computational Statistics 29

30 A Bayesian Setting Estimate the mode of a posterior distribution f(θ x) Consideration of unobserved random variables ψ in addition to the parameters of interest θ Missing data may not be really missing a conceptual tool to simplify problems J. Pei: Computational Statistics 30

31 Marginalization L(θ x) is a marginalization of the completedata likelihood L(θ y) = L(θ x, z) Alternatively, assume missing parameters ψ of no interest No difference under the Bayesian pardigm J. Pei: Computational Statistics 31

32 The EM Algorithm: Objectives Iteratively seek to maximize L(θ x) with respect to θ Let θ (t) be the estimated maximizer at iteration t, for t = 0, 1, Let Q(θ θ (t) ) be the expectation of the joint log likelihood for the complete data, conditional on the observed data X = x J. Pei: Computational Statistics 32

33 The EM Algorithm: Steps The E step: compute Q(θ θ (t) ) The M step: maximize Q(θ θ (t) ) with respect to θ and set θ (t+1) to the maximizer of Q Return to the E step until a stopping criterion is met J. Pei: Computational Statistics 33

34 Fuzzy Clustering Each point x i takes a probability w ij to belong to a cluster C j Requirements For each point x i, k j= 1 w ij = 1 For each cluster C j m 0 < i=1 w ij < m J. Pei: Computational Statistics 34

35 Fuzzy C-Means (FCM) Select an initial fuzzy pseudo-partition, i.e., assign values to all the w ij Repeat Recompute the fuzzy pseudo-partition, i.e., the w ij Compute the centroid of each cluster using the fuzzy pseudo-partition Until the centroids do not change (or the change is below some threshold) J. Pei: Computational Statistics 35

36 Critical Details Optimization on sum of the squared error k m (SSE): p 2 SSE( C Computing centroids: 1,, Ck ) = wij dist( xi, c j ) j= 1 i= 1 Updating the fuzzy pseudo-partition w ij = (1/ dist( x i, c j ) 2 ) 1 p 1 c j = k q= 1 m i= 1 w p ij x i / (1/ dist( x m i= 1 i, c w q p ij ) 2 ) 1 p 1 When p=2 w ij = 1/ dist( x i, c j ) 2 k q= 1 1/ dist( x i, c q ) 2 J. Pei: Computational Statistics 36

37 Choice of P When p 1, FCM behaves like traditional k-means When p is larger, the cluster centroids approach the global centroid of all data points The partition becomes fuzzier as p increases J. Pei: Computational Statistics 37

38 Effectiveness J. Pei: Computational Statistics 38

39 Mixture Models A cluster can be modeled as a probability distribution Practically, assume a distribution can be approximated well using multivariate normal distribution Multiple clusters are a mixture of different probability distributions A data set is a set of observations from a mixture of models J. Pei: Computational Statistics 39

40 Object Probability Suppose there are k clusters and a set X of m objects Let the j-th cluster have parameter θ j = (µ j, σ j ) The probability that a point is in the j-th cluster is w j, w w k = 1 The probability of an object x is k prob( x Θ) = w j p j ( x θ j ) m i= 1 j= 1 prob( X Θ) = prob( x Θ) = w p ( x θ ) i m k i= 1 j= 1 j j i j J. Pei: Computational Statistics 40

41 Example prob ( x µ ) 1 2 2σ ( ) x i Θ = e 2πσ 2 θ1 = ( 4,2) θ2 = (4,2) prob( x Θ) = 2 1 e 2π ( x+ 4) e 2π ( x 4) 8 2 J. Pei: Computational Statistics 41

42 Maximal Likelihood Estimation Maximum likelihood principle: if we know a set of objects are from one distribution, but do not know the parameter, we can choose the parameter maximizing the probability 2 m ( x µ ) Maximize 1 2 prob 2σ ( x Θ = i ) e 2πσ j= 1 Equivalently, maximize log prob( X Θ) = m i= 1 ( xi µ ) 2 2σ 2 0.5mlog 2π mlogσ J. Pei: Computational Statistics 42

43 EM Algorithm Expectation Maximization algorithm Select an initial set of model parameters Repeat Expectation Step: for each object, calculate the probability that it belongs to each distribution θ i, i.e., prob(x i θ i ) Maximization Step: given the probabilities from the expectation step, find the new estimates of the parameters that maximize the expected likelihood Until the parameters are stable J. Pei: Computational Statistics 43

44 Convergence The log of the observed-data density can be rewritten as Thus, where the expectations are taken with respect to the distribution of Z (x, θ (t) ) Define Then, J. Pei: Computational Statistics 44

45 H(θ θ (t) ) Maximized by θ = θ (t) For any θ θ (t), H(θ θ (t) ) < H(θ (t) θ (t) ) If θ (t+1) is chosen to maximize Q(θ θ (t) ) with respect to θ, since Q increases and H decreases, with strict inequality when Q(θ (t+1) θ (t) ) > Q(θ (t) θ (t) ), J. Pei: Computational Statistics 45

46 Generalized EM Algorithm Standard EM: choose θ (t+1) at each iteration to maximize Q(θ θ (t) ) with respect to θ Generalized EM (GEM): choose any θ (t+1) such that Q(θ (t+1) θ (t) ) > Q(θ (t) θ (t) ) Both EM and GEM converge J. Pei: Computational Statistics 46

47 Order of Convergence The EM algorithm defines a mapping θ (t+1) = Ψ(θ (t) ) When EM converges, it converges to a fixed point of the mapping The global rate of EM convergence is Let Ψ (θ) be the Jacobian matrix whose (i, j)-th element is dψ i (θ)/dθ j Then, ρ equals the largest eigenvalue of Ψ (θ) when is positive definite J. Pei: Computational Statistics 47

48 Rationale Jaccobian matrix Where F: R n R m If function f is differentiable at a point x, then the Jacobian matrix defines a linear map R n R m, which is the best linear approximation of the function f near the point x J. Pei: Computational Statistics 48

49 Implications Linear convergence EM suffers slower convergence when the proportion of missing information is larger Ease of implementation J. Pei: Computational Statistics 49

50 Hessian Matrix For function f: R n R such that all second partial derivatives of f exist and are continuous over the domain of the function, the Hessian matrix (or simply Hessian) is Using Taylor expansion, J. Pei: Computational Statistics 50

51 Missing-Information Principle Recall Take second partial derivatives and negate both sides Rewrite to The missing-information principle : the observed information : the complete information : the missing information J. Pei: Computational Statistics 51

52 Estimating Covariance The covariance matrix for is Where the variance is taken with respect to f Z X J. Pei: Computational Statistics 52

53 MONTE CARLO METHODS J. Pei: Computational Statistics 53

54 Estimation of a Definite Integral Motivation example: estimate Situations where Monte Carlo is not needed The integral can be evaluated in closed form D is of low dimensionality (e.g., 1-2) Decomposition: if f(x) can be rewritten as f(x) = g(x) p(x) such that p(x) can be treated as a probability density function, that is, p(x) 0 and J. Pei: Computational Statistics 54

55 Estimation How to estimate? Draw a random sample {x 1, x 2,, x m } from the distribution with probability density p J. Pei: Computational Statistics 55

56 Monte Carlo Method (PDF decomposition) Decompose a function of interest to include a probability density function as a factor Identify an expected value Use a sample to estimate the expected value The PDF decomposition may not be unique Typically used when dimensionality is high J. Pei: Computational Statistics 56

57 Example Uniform Distribution In the [0, 1] x [0, 1] square, what is the average distance between two points? In general, we can answer this question for any finite region using any distance measure Compute Draw a random sample {(x 1, y 1 ), (x 2, y 2 ),, (x m, y m )}, an estimate is J. Pei: Computational Statistics 57

58 Example Non-Uniform Distribution In an airport, what is the average distance between the end of a runway and where a plan lands? The landing points are not uniformly distributed Let L be the length of the runway, and p(x) be the probability that a plane lands at the point of distance x to the end of the runway Compute Using a sample {x 1, x 2,, x m } of landing points following the distribution of p(x), an estimate is J. Pei: Computational Statistics 58

59 Estimation of Variance Assumption: the random sample {g(x i )} are independent and have 0 correlations J. Pei: Computational Statistics 59

60 Simulation Estimate Key: simulation of p p is called the target distribution using MC If p is a standard parametric distribution, software tools are available to simulate J. Pei: Computational Statistics 60

61 Inverse CDF For any continuous distribution function f, if U ~ Unif(0, 1), then X = f 1 (U) = inf {x : f(x) U} has a cumulative distribution function equal to f If f -1 is available, we can simulate the target distribution accordingly Known as inverse cumulative distribution function or probability integral transform approach J. Pei: Computational Statistics 61

62 Linear Interpolation If f 1 is not available but f is either available or easily approximated Use a grid of x 1, x 2,, x m spanning the region of support of the target distribution Calculate or approximate u i = f(x i ) Draw U ~ Unif(0, 1) and linearly interpolate between the two nearest grid points for which u i U u j according to J. Pei: Computational Statistics 62

63 Linear Interpolation: Pros and Cons The degree of approximation is deterministic Can be reduced to any desired level by increasing m sufficiently Require a complete approximation to f regardless of the desired sample size Do not generalize to multiple dimensions Less efficient than other approaches. J. Pei: Computational Statistics 63

64 Rejection Sampling Assumptions f(x) can be calculated or approximated We know how to sample from density distribution g and can calculate g(x) We have an envelope e(.) such that e(x) = g(x) / α f(x) for all x that f(x) > 0, and α 1 is a constant Algorithm 1. Sample Y ~ g 2. Sample U ~ Unif(0, 1) 3. Reject Y if U > f(y) / e(y), return to step 1 4. Keep the value of Y as a unit of the random sample of f J. Pei: Computational Statistics 64

65 Illustration J. Pei: Computational Statistics 65

66 Why Does Rejection Sampling Work? J. Pei: Computational Statistics 66

67 Efficiency of Rejection Sampling Rejection sampling can be regarded as uniform sampling from the 2-d region under the curve e, and then discard draws between e and f The larger the area between e and, the lower the efficiency many draws may have to be discarded J. Pei: Computational Statistics 67

68 Squeezed Rejection Sampling For cases where evaluating f is costly Use a non-negative squeezing function s such that s(x) f(x) for any point f(x) > 0 Algorithm Sample Y ~ g Sample U ~ Unif(0, 1) If U s(y) / e(y), keep the value as a unit in the sample Otherwise, check whether U f(y) / e(y). If so, keep the value as a unit of the sample J. Pei: Computational Statistics 68

69 Illustration Saving J. Pei: Computational Statistics 69

70 Envelope & Squeezing Functions? Challenge: how to develop envelope and squeezing functions automatically? Assumptions Let l(x) = log f(x) f(x) > 0 on a (possibly infinite) interval f is log-concave, that is, for any a < b < c in the support region of f, l(a) 2l(b) + l(c) f is continuous and differentiable l (x) exists and decreases monotonically with respect to x (but may have discontinuities) J. Pei: Computational Statistics 70

71 Adaptive Rejection Sampling Evaluate l and l at k points x 1 < x 2 < < x k The tangents at x i and x i+1 intersect at for I = 1,, k 1 For, defines an envelope Similarly, defines a squeezing function J. Pei: Computational Statistics 71

72 Illustration J. Pei: Computational Statistics 72

73 Illustration J. Pei: Computational Statistics 73

74 An Alternative Not Using l Define L i (.) to be the straight line function connecting (x i, l(x i )) and (x i+1, l(x i+1 )) for i = 1,, k 1 defines an envelope J. Pei: Computational Statistics 74

75 Illustration J. Pei: Computational Statistics 75

76 Approximate Simulation Suppose we have an envelope g for target density f Let X = {x 1, x 2,, x m } be a set of values drawn iid from g Define the standardized importance weights as J. Pei: Computational Statistics 76

77 Sampling Importance Resampling Sampling candidates Y 1, Y 2,, Y m iid from g Calculate the standardized importance weights w(y 1 ), w(y 2 ),, w(y m ) Resample X 1, X 2,, X m from Y 1, Y 2,, Y m with replacement with probabilities w(y 1 ), w(y 2 ),, w(y m ) A random variable X drawn with the sampling importance resampling algorithm has distribution that converges to f as m J. Pei: Computational Statistics 77

78 Comparison Rejection sampling is perfect the distribution of a generated draw is exactly f, but requires a random number of draws to obtain a sample of size n Sampling importance resampling uses a predefined number of draws to generate a sample of size n but permits a random degree of approximation to f in the distribution of the sampled points J. Pei: Computational Statistics 78

79 MARKOV CHAIN MONTE CARLO J. Pei: Computational Statistics 79

80 Joint Distribution of a Sequence A sequence of random variables {X (t) }, t = 0, 1, Each X (t) may equal one of a finite or countable infinite number of possible variables, called states The state space is the set of possible values of the random variable X (t) A complete probabilistic specification of X (0), X (1),, X (n), i.e., the joint distribution, is J. Pei: Computational Statistics 80

81 Markov Property The conditional independence assumption Then, the joint distribution can be simplified J. Pei: Computational Statistics 81

82 Markov Chains Let be the probability that the observed state changes from state i at time t to state j at time t + 1 The sequence {X (t) }, t = 0, 1, is a Markov chain if A Markov chain is time homogeneous if, and time-inhomogeneous otherwise J. Pei: Computational Statistics 82

83 Transition Probability Matrix For a time homogeneous Markov chain The state space has s states Transition probability matrix P = [p ij ] describes the one-state transition probability 0 p ij 1 J. Pei: Computational Statistics 83

84 Example J. Pei: Computational Statistics 84

85 Recurrent and Nonnull States A state is recurrent if the chain returns to the state with probability 1 P( {X(t) = i} = ) = 1 Otherwise, the state is transient, that is, we will have a non-zero probability to leave the state and never come back A state is nonnull (aka positive recurrent) if the expected time until recurrence is finite If the state space is finite, every recurrent state is nonnull J. Pei: Computational Statistics 85

86 Ergodic Markov Chains A Markov chain is irreducible if for any i and j, state j can be reached from state i in a finite number of steps There exists m > 0 such that A state j has period d > 0 if the probability of going from state j to state j itself in n steps is 0 for all n not divisible by d A Markov chain is aperiodic if every state has period 1; otherwise it is periodic A Markov chain is ergodic if it is irreducible, aperiodic, and all states are nonnull and recurrent J. Pei: Computational Statistics 86

87 Stationary Distribution Let π be a vector of probabilities such that Σπ i = 1 π i denotes the marginal probability that X (t) = i The marginal distribution of X (t+1) is π T P A distribution π such that π T P = π T is called a stationary distribution for the Markov chain having transition probability matrix P If X (t) follows a stationary distribution, then the marginal distributions of X (t) and X (t+1) are identical J. Pei: Computational Statistics 87

88 Reversible Markov Chains A time-homogeneous Markov chain is reversible if for any i and j in the state space, π i p ij = π j p ji (aka detailed balance condition) π is a stationary distribution for the chain The joint distribution of a sequence of observations is the same no matter the chain is run forwards or backwards J. Pei: Computational Statistics 88

89 Uniqueness of Stationary Distribution If a Markov chain with transition probability matrix P and stationary distribution π is irreducible and aperiodic, then π is unique and π j s are the solutions to the following set of equations J. Pei: Computational Statistics 89

90 Ergodic Theorem If X (1), X (2), are realizations from an irreducible and aperiodic Markov chain with stationary distribution π, then X (n) converges in distribution to the distribution given by π, and for any function h, provided E π { h(x) } exists J. Pei: Computational Statistics 90

91 Bayesian Inference Suppose X has a distribution parameterized by θ Prior distribution f(θ): the density assigned to θ before observing the data Bayes theorem: f(θ x): the posterior density of θ, used for statistical inference about θ c = 1/ f(θ)l(θ x)dθ, often difficult to compute directly J. Pei: Computational Statistics 91

92 Bayes Factor Let be the posterior mode, and θ * be the true value of θ The posterior distribution of converges to N(θ, I(θ ) 1 ) as n, under regularity conditions The observed data should overwhelm any prior as n For two competing hypotheses H 1 and H 2, the Bayes factor is J. Pei: Computational Statistics 92

93 Markov Chain Monte Carlo Idea: construct an irreducible, aperiodic Markov chain whose stationary distribution equals the target distribution f in Monte Carlo Challenge: the distribution of X (t) may differ substantially from f when t is too small and X (t) are serially dependent J. Pei: Computational Statistics 93

94 Metropolis-Hastings Algorithm At t = 0, select X (0) = x (0) drawn at random from some starting distribution g with the requirement f(x (0) ) > 0 Let X (t) = x(t), generate X (t+1) as follows 1. Sample a candidate value X * from a proposal distribution g(. x (t) ) 2. Compute the Metropolis-Hastings ratio R(x (t), X * ) 3. Sample a value for X(t+1) according to the following 4. Increment t and return to step 1 J. Pei: Computational Statistics 94

95 Why Does Metroplis-Hastings Work? Markov: X (t+1) is only dependent on X (t) A user has to check whether the chain is irreducible and aperiodic If so, the chain generated has a unique limiting stationary distribution Intuition We can compute a function proportional to the target distribution, but not exact As more and more sample values are produced, the distribution of values more closely approximates the desired distribution J. Pei: Computational Statistics 95

96 Burn-in Period Sometimes the chain is dependent on the starting value persistently Burn-in period: omit some of the initial realizations of the chain when computing a sample average Challenge: how to design proposal distributions? J. Pei: Computational Statistics 96

97 Independence Chains Set g(x * x (t) ) = g(x * ) for some fixed density g Each candidate value is drawn independently of the past The Metropolis-Hastings ratio is The resulting Markov chain is irreducible and aperiodic if g(x) > 0 whenever f(x) > 0 J. Pei: Computational Statistics 97

98 Example Consider a set of observed data points sampled iid from the mixture distribution Find the value of δ J. Pei: Computational Statistics 98

99 Setting Prior and Proposal Distribution Assume the prior distribution for δ is Unif(0, 1) Proposal distributions Chain 1 uses Beta(1, 1) Equivalent to Unif(0, 1) Chain 2 uses Beta(2, 10), skewed right with mean ~ The values around 0.7 are unlikely to be generated J. Pei: Computational Statistics 99

100 Sample Paths J. Pei: Computational Statistics 100

101 Histograms of δ (t) J. Pei: Computational Statistics 101

102 Random Walk Chains Draw ε ~ h(ε) for some density h Common choices for h include a uniform distribution over a ball centered at the origin, a scaled standard normal distribution, and a scaled Student s t distribution Set X* = x(t) + ε g(x * x (t) ) = h(x * x (t) ) If the support region of f is connected and h is positive in a neighborhood of 0, the resulting chain is irreducible and aperiodic J. Pei: Computational Statistics 102

103 Why Do Random Walk Chains Work? J. Pei: Computational Statistics 103

104 Gibbs Sampling Idea Sampling in a high dimensional space sometimes is difficult Gibbs sampling sequentially samples from univariate conditional distributions Let Suppose that the univariate conditional density of, denoted by, is easily sampled for i = 1,..., p J. Pei: Computational Statistics 104

105 Gibbs Sampling Algorithm Select starting values x (0), and set t = 0 Generate, in turn, Increment t and go to step 2 J. Pei: Computational Statistics 105

106 BOOTSTRAPPING J. Pei: Computational Statistics 106

107 Feature Estimation Consider a cumulative distribution function F We are interested in a feature of F expressed as a function θ = T(F) of F Example: is the mean of F Suppose we have random variables X 1,, X n ~ i.i.d. F (denoted as X ~ F), and x 1,, x n are realization of X 1,, X n If is the empirical distribution function of the observed data, then is an estimate of θ J. Pei: Computational Statistics 107

108 Unknown Distribution Function When F is unknown, we are still interested in or R(X, F), where X is the set of observed data Example:, where is the estimated standard deviation of Bootstrap: approximate the distribution of using the observed data (an estimate of F) J. Pei: Computational Statistics 108

109 Bootstrap Sample of Pseudo-data Also called a pseudo-dataset Each is an i.i.d. random variables with distribution J. Pei: Computational Statistics 109

110 A Simple Example Let {x1, x2, x3} = {1, 2, 6} is an i.i.d. sample from a distribution F that has mean θ We want to estimate sample mean Let consist elements drawn i.i.d. from -- there are 27 possible outcomes for Let be the empirical distribution function of such as sample and There are only 10 possible outcomes of J. Pei: Computational Statistics 110

111 Possible Outcomes of J. Pei: Computational Statistics 111

112 The Bootstrap Principle Approximate using Example: 25/27 (~93%) confidence interval for θ is (4/3, 14/3) using quantiles of the distribution of J. Pei: Computational Statistics 112

113 Non-parametric Bootstrap If the sample size is big, the number of potential bootstrap pseudo-datasets is very large impossible to enumerate all possible bootstrap pseudo-dataset Draw B independent random bootstrap pseudo-datasets for i = 1,, B Approximate R(X, F) using the empirical distribution of for i = 1,, B The simulation error can be arbitrarily small by increasing B J. Pei: Computational Statistics 113

114 A Simple Example Draw with replacement from {1, 2, 6} with equal probability Each bootstrap pseudo-dataset produces a corresponding estimation of J. Pei: Computational Statistics 114

115 Bootstrap Bias Correction If we set bias of The mean is bootstrap by, estimated using, it is the J. Pei: Computational Statistics 115

116 Ensemble Classifiers C*(x)=Vote(C1(x),, Ck(x)) Figure from [Tan, Steinbach, Kumar] J. Pei: Computational Statistics 116

117 Why May Ensemble Method Work? Suppose there are two classes and each base classifier has an error rate of 35% What if we use 25 base classifiers? If all base classifiers are identical, the ensemble error rate is still 35% If base classifiers are independent, the ensemble makes a wrong prediction only if more than half of the base classifiers are wrong 25 i= i i 25 i = 0.06 J. Pei: Computational Statistics 117

118 Ensemble Error Rate Figure from [Tan, Steinbach, Kumar] J. Pei: Computational Statistics 118

119 Ensemble Classifiers When? The base classifiers should be independent of each other Each base classifier should do better than a classifier that performs random guessing J. Pei: Computational Statistics 119

120 How to Construct Ensemble? Manipulating the training set: derive multiple training sets and build a base classifier on each Manipulating the input features: use only a subset of features in a base classifier Manipulating the class labels: if there are many classes, in a classifier, randomly divide the classes into two subsets A and B; for a test case, if a base classifier predicts its class as A, all classes in A receive a vote Manipulating the learning algorithm, e.g., using different network configuration in ANN J. Pei: Computational Statistics 120

121 Bootstrap Given an original training set T, derive a tranining set T by repeatedly uniformly sampling with replacement If T has n tuples, each tuple has a probability p = 1 - (1-1/n) n of being selected in T When n, p 1-1/e Use the tuples not in T as the test set J. Pei: Computational Statistics 121

122 Bagging Run bootstrap k times to obtain k base classifiers A test instance is assigned to the class that receives the highest number of votes Strength: reduce the variance of base classifiers good for unstable base classifiers Unstable classifiers: sensitive to minor perturbations in the training set, e.g., decision trees, associative classifiers, and ANN For stable classifiers (e.g., linear discriminant analysis and knn classifiers), bagging may even degrade the performance since the training sets are smaller Less overfitting on noisy data J. Pei: Computational Statistics 122

123 Bootstrap in Testing Use a bootstrap sample as the training set, use the tuples not in the training set as the test set.632 bootstrap: compute the overall accuracy by combining the accuracies of each bootstrap sample with the accuracy computed from a classifier using the whole data set as the training set k 1 acc 632bootstrap = (0.632 ε i acc k. all 1 ) J. Pei: Computational Statistics 123

124 NONPARAMETRIC DENSITY ESTIMATION J. Pei: Computational Statistics 124

125 Density Estimation Using observations of random variables X 1, X 2,, X n sampled independently from a density function f, estimate f Why density estimation? Assess multimodality, skew, tail behavior, Decision making, classification, summarize Bayesian posteriors A useful presentation tool to summarize a distribution A tool in some other methods, such as MCMC J. Pei: Computational Statistics 125

126 General Idea Assume a parametric model X 1, X 2,, X n ~ iid f X θ, where θ is a very low-dimensional parameter vector Estimate parameter using some estimation paradigm, such as maximum likelihood, Bayesian, or method-of-moments estimation The resulting density estimate at x is If the assumed model f X θ is incorrect, the inferential error is big J. Pei: Computational Statistics 126

127 Moments When a set of points representing a probability density 0 th moment: total of probability, i.e., 1 1 st moment: mean 2 nd moment: variance 3 rd moment: skewness Normalized moment: J. Pei: Computational Statistics 127

128 Method of Moments Idea: use relations between population moments and parameters of interest To estimate k unknown parameters θ = (θ 1, θ 2,, θ k ) of distribution f X (x θ), if the first k moments of the true distribution can be expressed as functions of θ: Draw a sample of n units x 1, x 2,, x n, then the i-th sample moment is An unbiased estimation of population moments Solve equations J. Pei: Computational Statistics 128

129 Nonparametric Density Estimation Assume very little about the form of f Use mainly local information to estimate f at a point x J. Pei: Computational Statistics 129

130 Motivation If f is smooth enough and we observe a point X i = x i, we assume f assigns some density not only at xi but also in a region around x i To estimate f from X 1, X 2,, X n ~ iid f, we accumulate localized probability density contributions in regions around each X i J. Pei: Computational Statistics 130

131 h-neighborhood To estimate the density at a point x, consider a region of width dx = 2h, centered at x h is a parameter to be set The proportion of the observations that fall in the interval γ = [x h, x + h] approximates the density at x Take, then Here, 1 {A} = 1 if A is true, otherwise 0 J. Pei: Computational Statistics 131

132 Estimation is the number of sample points in the interval γ N γ is a Bin(n, p(γ)) random variable, where E[N γ /n] = p(γ) var[n γ /n] = p(γ)(1 p(γ)) / n To obtain accurate estimation, h 0 and n, since var[n γ /n] 0 when n Thus, we should require nh, h 0 as n J. Pei: Computational Statistics 132

133 Performance Measures Integrated squared error (ISE) Mean integrated squared error (MISE) MISE(h) = E{ISE(h)} MISE(h) is the accumulation of local mean squared error at each x J. Pei: Computational Statistics 133

134 Kernel Functions Recall Every point in the h-neighborhood weights the same Intuition: closer points should be weighted more Idea: use a kernel function A kernel function K() is a non-negative realvalued integrate-able function such that and K(-u) = k(u) for any u J. Pei: Computational Statistics 134

135 Kernel Density Estimation Using a kernel function K, estimate by J. Pei: Computational Statistics 135

136 Bandwidth Parameter h is called the (fixed) bandwidth The bandwidth has strong influence on the estimator If h is too small, tend to assign probability density too locally near observed data, resulting in a very wiggly estimated density function with many false modes If h is too large, spread probability density contributions too diffusely averaging over neighborhoods that are too large smooths away important features of f The bandwidth controls the trade-off between the bias and variance of the estimator J. Pei: Computational Statistics 136

137 Example J. Pei: Computational Statistics 137

138 Roughness Measure the roughness of a function g Assume R(K) < and f and f are bounded and continuous, R(f ) < Recall Thus, J. Pei: Computational Statistics 138

139 Computing the Bias Using the Taylor series expansion, Since K is symmetric about 0, Thus, J. Pei: Computational Statistics 139

140 Computing the Variance Using a similar strategy Thus, where (asymptotic mean integrated squared error) J. Pei: Computational Statistics 140

141 Optimal Bandwidth Set h at an intermediate value that avoids excessive bias and excessive variability Theoretically, is the optimal bandwidth Not practically useful since it depends on f, which is to be estimated Turn to some heuristic methods J. Pei: Computational Statistics 141

142 Cross-Validation Ideas Relate h to some quantified quality measure Q(h) on as an estimator of f Optimize and find the optimal h Problem: we use the observed data to calculate and use the same data to evaluate the quality of may lead to overfitting Remedy: use some data the to learn a model and the rest of data to evaluate J. Pei: Computational Statistics 142

143 Cross-Validation To evaluate the quality of at the i-th data point, use all the data except for the i-th data point to train the model Denote the estimated density at X i using a kernel density estimator with all the observations except for X i by Set a function of The bandwidth estimated can be highly sensitive to sampling variability J. Pei: Computational Statistics 143

144 Using Pseudo-Likelihood Set to pseudo-likelihood (PL) Optimize PL(h) and find the optimal h Simple and intuitively appealing Estimation produced is too wiggly and sensitive to outliers J. Pei: Computational Statistics 144

145 Unbiased Cross-Validation Criterion Rewrite the integrated squared error as The last term is constant The second term can be estimated by Thus, minimizing h can find a good h It is called the unbiased cross-validation criterion because E{ UCV (h) + R(f)} = MISE (h) Aka least squares cross-validation because choosing h to minimize UCV (h) minimizes the integrated squared error between and f with respect to J. Pei: Computational Statistics 145

146 Using a Normal Kernel If analytic evaluation of is infeasible, use a different kernel that permits an analytic simplification If a normal kernel ϕ is used, Slow convergence to the optimum Strong dependence on the observed data when applied to different data sets drawn from the same distribution, may yield very different answers J. Pei: Computational Statistics 146

147 Plug-in Methods Apply a pilot bandwidth to estimate one or more important features of f The bandwidth for estimating f itself is then estimated at a second stage using a criterion depending on the estimated features Theoretically, is the optimal bandwidth Key: estimate R(f ) Very effective in many applications J. Pei: Computational Statistics 147

148 Silverman s Rule of Thumb Replace f by a normal density with variance set to match the sample variance Equivalently, estimate R(f ) by, where ϕ is the standard normal density function J. Pei: Computational Statistics 148

Empirical Estimation of R(f ) Empirical estimation of R(f ) in The kernel based estimator is h 0 is the bandwidth and L is a sufficiently differentiable kernel used to estimate f The

149 Empirical Estimation of R(f ) Empirical estimation of R(f ) in The kernel based estimator is h 0 is the bandwidth and L is a sufficiently differentiable kernel used to estimate f The best bandwidth for estimating f differs from the best bandwidth for estimating f or R(f ) A larger bandwidth is required for estimating f, that is h 0 > h J. Pei: Computational Statistics 149

150 The Sheather-Jones Method A two-stage process A simple rule of thumb is used to calculate the bandwidth h 0 Use h 0 to estimate R(f ) Compute h using J. Pei: Computational Statistics 150

151 An Implementation For univariate kernel density estimation with pilot kernel L = ϕ, the Sheather-Jones bandwidth is the value of h that solves the equation J. Pei: Computational Statistics 151

152 An Implementation (Cont d) Here, Estimation using, for example, Newton s method J. Pei: Computational Statistics 152

153 Interquartile Range (IQR) A measure of statistical dispersion, aka midspread or middle fifty The difference between the upper and lower quartiles: IQR = Q 3 Q 1 IQR = 10 J. Pei: Computational Statistics 153

154 Maximal Smoothing Principle The bandwidth should be selected to discourage false modes, producing an estimate that shows modes only where the data indisputably require them Replace R(f ) with the most conservative (i.e., smallest) possible value Consider all h that would minimize, and select the largest h The right-hand side of should be maximized with respect to f J. Pei: Computational Statistics 154

155 Implementation Set J. Pei: Computational Statistics 155

Computational statistics

Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated