Computational Statistics. Jian Pei School of Computing Science Simon Fraser University
|
|
- Marilynn Stevenson
- 5 years ago
- Views:
Transcription
1 Computational Statistics Jian Pei School of Computing Science Simon Fraser University
2 BASIC OPTIMIZATION METHODS J. Pei: Computational Statistics 2
3 Why Optimization? In statistical inference, we have to conduct maximum likelihood estimation Most functions cannot be analytically optimized Example: how to maximize? Set The equation does not have an analytic solution J. Pei: Computational Statistics 3
4 Problem Formulation Let l be the log likelihood function If is the maximum likelihood estimation (MLE), then it is a solution to the following score equation Where J. Pei: Computational Statistics 4
5 Example J. Pei: Computational Statistics 5
6 The Bisection Method Idea If g is continuous on [a 0, b 0 ] and g (a 0 )g (b 0 ) 0, then there exists at least one x* in [a 0, b 0 ] such that g (x*) = 0 Systematically shrink the interval from [a 0, b 0 ] to [a 1, b 1 ] to [a 2, b 2 ] and so on J. Pei: Computational Statistics 6
7 The Bisection Method Let be the starting value, then Where J. Pei: Computational Statistics 7
8 Example J. Pei: Computational Statistics 8
9 Stopping Rule Stop if the procedure appears to have achieved satisfactory convergence; or if it appears unlikely to do so soon The absolute convergence criterion For bisection, True error tolerance of when, that is holds J. Pei: Computational Statistics 9
10 Relative Convergence Criterion Criterion May not work well when x* is close to 0 Alternatively, The bisection method in theory guaranteed to converge to a root J. Pei: Computational Statistics 10
11 Stopping in Failure to Converge MLE may not be unique The convergence may not happen due to computational precision Stopping rules that flag a failure to converge Stop after N iterations Some convergence measures fail to decrease or cycle over several iterations J. Pei: Computational Statistics 11
12 Bracketing Methods Bound a root within a sequence of nested intervals of decreasing length Bisection is a bracketing method Slow! If g is continuous on [a 0, b 0 ], a root can be found Do not rely on the existence, behavior, or ease of deriving of g J. Pei: Computational Statistics 12
13 Newton-Raphson Iteration Assumption: g is continuously differentiable and g (x*) 0 At iteration t, approximate g (x*) by the linear Taylor series expansion That is, The updating equation: Where J. Pei: Computational Statistics 13
14 Example J. Pei: Computational Statistics 14
15 Newton s Method May Not Converge J. Pei: Computational Statistics 15
16 An Interesting Situation Suppose g is continuous and g (x*) 0 There exists a neighborhood of x* within which g (x) 0 for all x Define Using a Taylor expansion, for some q between x (t) and x* That is, That is, J. Pei: Computational Statistics 16
17 An Interesting Situation (cont d) Consider a neighborhood of x*, for δ > 0, Define Then,, i.e., Thus, we can choose a δ such that δc(δ) < 1 J. Pei: Computational Statistics 17
18 An Interesting Situation (cont d) If, then from We have Assuming a starting point satisfying J. Pei: Computational Statistics 18
19 Newton s Method Converges If g is continuous and x* is a simple root of g, then there exists a neighborhood of x* for which Newton s method converges to x* when started from any x (0) in that neighborhood A root is simple if it cannot be written as the sum of two roots Some other conditions where Newton s method converges exist J. Pei: Computational Statistics 19
20 Convergence Order Measuring how fast a root-finding method is A method has convergence of order β > 0 if It converges,, and Convergence speed c 0 is a constant J. Pei: Computational Statistics 20
21 Convergence Order of Newton s Method Recall Then, If Newton s method converges, then Newton s method has quadratic convergence, that is β = 2 J. Pei: Computational Statistics 21
22 Fisher Scoring Replace in the Newton update with is the Fisher information, and can be approximated by, is the expected Fisher information evaluated at The updating equation is J. Pei: Computational Statistics 22
23 Comparison Fisher scoring and Newton s method have the same asymptotic properties Fisher scoring works better in the beginning to make rapid improvements Newton s method works better for refinement near the end J. Pei: Computational Statistics 23
24 Secant Method Newton s method becomes inconvenient if g (x (t) ) is hard to calculate Use the discrete-difference approximation The updating equation becomes, for t > 0, J. Pei: Computational Statistics 24
25 Convergence Order The secant method has the convergence of order of The secant method has a slower order of convergence than Newton s method J. Pei: Computational Statistics 25
26 Example J. Pei: Computational Statistics 26
27 THE EM METHOD J. Pei: Computational Statistics 27
28 A Frequentist Setting Conceive of observed data θ generated from random variables X along with some missing or unobserved data from random variable Z Z is often called latent Envision complete data generated from Y = (X, Z) Given observed data x, we wish to maximize a likelihood L(θ x) often hard to work with Easier to deal with the densities of Y θ and Z (x, θ) EM uses the easier densities to estimate L(θ x) J. Pei: Computational Statistics 28
29 Latent Z Conceptually, Z may be viewed as having been removed from the complete Y through the application of some many-to-fewer mapping X = M(Y) Let f X (x θ) be the density of the observed data Let f Y (y θ) be the density of the complete data A marginalization model Observe X having density Conditional density of the missing data J. Pei: Computational Statistics 29
30 A Bayesian Setting Estimate the mode of a posterior distribution f(θ x) Consideration of unobserved random variables ψ in addition to the parameters of interest θ Missing data may not be really missing a conceptual tool to simplify problems J. Pei: Computational Statistics 30
31 Marginalization L(θ x) is a marginalization of the completedata likelihood L(θ y) = L(θ x, z) Alternatively, assume missing parameters ψ of no interest No difference under the Bayesian pardigm J. Pei: Computational Statistics 31
32 The EM Algorithm: Objectives Iteratively seek to maximize L(θ x) with respect to θ Let θ (t) be the estimated maximizer at iteration t, for t = 0, 1, Let Q(θ θ (t) ) be the expectation of the joint log likelihood for the complete data, conditional on the observed data X = x J. Pei: Computational Statistics 32
33 The EM Algorithm: Steps The E step: compute Q(θ θ (t) ) The M step: maximize Q(θ θ (t) ) with respect to θ and set θ (t+1) to the maximizer of Q Return to the E step until a stopping criterion is met J. Pei: Computational Statistics 33
34 Fuzzy Clustering Each point x i takes a probability w ij to belong to a cluster C j Requirements For each point x i, k j= 1 w ij = 1 For each cluster C j m 0 < i=1 w ij < m J. Pei: Computational Statistics 34
35 Fuzzy C-Means (FCM) Select an initial fuzzy pseudo-partition, i.e., assign values to all the w ij Repeat Recompute the fuzzy pseudo-partition, i.e., the w ij Compute the centroid of each cluster using the fuzzy pseudo-partition Until the centroids do not change (or the change is below some threshold) J. Pei: Computational Statistics 35
36 Critical Details Optimization on sum of the squared error k m (SSE): p 2 SSE( C Computing centroids: 1,, Ck ) = wij dist( xi, c j ) j= 1 i= 1 Updating the fuzzy pseudo-partition w ij = (1/ dist( x i, c j ) 2 ) 1 p 1 c j = k q= 1 m i= 1 w p ij x i / (1/ dist( x m i= 1 i, c w q p ij ) 2 ) 1 p 1 When p=2 w ij = 1/ dist( x i, c j ) 2 k q= 1 1/ dist( x i, c q ) 2 J. Pei: Computational Statistics 36
37 Choice of P When p 1, FCM behaves like traditional k-means When p is larger, the cluster centroids approach the global centroid of all data points The partition becomes fuzzier as p increases J. Pei: Computational Statistics 37
38 Effectiveness J. Pei: Computational Statistics 38
39 Mixture Models A cluster can be modeled as a probability distribution Practically, assume a distribution can be approximated well using multivariate normal distribution Multiple clusters are a mixture of different probability distributions A data set is a set of observations from a mixture of models J. Pei: Computational Statistics 39
40 Object Probability Suppose there are k clusters and a set X of m objects Let the j-th cluster have parameter θ j = (µ j, σ j ) The probability that a point is in the j-th cluster is w j, w w k = 1 The probability of an object x is k prob( x Θ) = w j p j ( x θ j ) m i= 1 j= 1 prob( X Θ) = prob( x Θ) = w p ( x θ ) i m k i= 1 j= 1 j j i j J. Pei: Computational Statistics 40
41 Example prob ( x µ ) 1 2 2σ ( ) x i Θ = e 2πσ 2 θ1 = ( 4,2) θ2 = (4,2) prob( x Θ) = 2 1 e 2π ( x+ 4) e 2π ( x 4) 8 2 J. Pei: Computational Statistics 41
42 Maximal Likelihood Estimation Maximum likelihood principle: if we know a set of objects are from one distribution, but do not know the parameter, we can choose the parameter maximizing the probability 2 m ( x µ ) Maximize 1 2 prob 2σ ( x Θ = i ) e 2πσ j= 1 Equivalently, maximize log prob( X Θ) = m i= 1 ( xi µ ) 2 2σ 2 0.5mlog 2π mlogσ J. Pei: Computational Statistics 42
43 EM Algorithm Expectation Maximization algorithm Select an initial set of model parameters Repeat Expectation Step: for each object, calculate the probability that it belongs to each distribution θ i, i.e., prob(x i θ i ) Maximization Step: given the probabilities from the expectation step, find the new estimates of the parameters that maximize the expected likelihood Until the parameters are stable J. Pei: Computational Statistics 43
44 Convergence The log of the observed-data density can be rewritten as Thus, where the expectations are taken with respect to the distribution of Z (x, θ (t) ) Define Then, J. Pei: Computational Statistics 44
45 H(θ θ (t) ) Maximized by θ = θ (t) For any θ θ (t), H(θ θ (t) ) < H(θ (t) θ (t) ) If θ (t+1) is chosen to maximize Q(θ θ (t) ) with respect to θ, since Q increases and H decreases, with strict inequality when Q(θ (t+1) θ (t) ) > Q(θ (t) θ (t) ), J. Pei: Computational Statistics 45
46 Generalized EM Algorithm Standard EM: choose θ (t+1) at each iteration to maximize Q(θ θ (t) ) with respect to θ Generalized EM (GEM): choose any θ (t+1) such that Q(θ (t+1) θ (t) ) > Q(θ (t) θ (t) ) Both EM and GEM converge J. Pei: Computational Statistics 46
47 Order of Convergence The EM algorithm defines a mapping θ (t+1) = Ψ(θ (t) ) When EM converges, it converges to a fixed point of the mapping The global rate of EM convergence is Let Ψ (θ) be the Jacobian matrix whose (i, j)-th element is dψ i (θ)/dθ j Then, ρ equals the largest eigenvalue of Ψ (θ) when is positive definite J. Pei: Computational Statistics 47
48 Rationale Jaccobian matrix Where F: R n R m If function f is differentiable at a point x, then the Jacobian matrix defines a linear map R n R m, which is the best linear approximation of the function f near the point x J. Pei: Computational Statistics 48
49 Implications Linear convergence EM suffers slower convergence when the proportion of missing information is larger Ease of implementation J. Pei: Computational Statistics 49
50 Hessian Matrix For function f: R n R such that all second partial derivatives of f exist and are continuous over the domain of the function, the Hessian matrix (or simply Hessian) is Using Taylor expansion, J. Pei: Computational Statistics 50
51 Missing-Information Principle Recall Take second partial derivatives and negate both sides Rewrite to The missing-information principle : the observed information : the complete information : the missing information J. Pei: Computational Statistics 51
52 Estimating Covariance The covariance matrix for is Where the variance is taken with respect to f Z X J. Pei: Computational Statistics 52
53 MONTE CARLO METHODS J. Pei: Computational Statistics 53
54 Estimation of a Definite Integral Motivation example: estimate Situations where Monte Carlo is not needed The integral can be evaluated in closed form D is of low dimensionality (e.g., 1-2) Decomposition: if f(x) can be rewritten as f(x) = g(x) p(x) such that p(x) can be treated as a probability density function, that is, p(x) 0 and J. Pei: Computational Statistics 54
55 Estimation How to estimate? Draw a random sample {x 1, x 2,, x m } from the distribution with probability density p J. Pei: Computational Statistics 55
56 Monte Carlo Method (PDF decomposition) Decompose a function of interest to include a probability density function as a factor Identify an expected value Use a sample to estimate the expected value The PDF decomposition may not be unique Typically used when dimensionality is high J. Pei: Computational Statistics 56
57 Example Uniform Distribution In the [0, 1] x [0, 1] square, what is the average distance between two points? In general, we can answer this question for any finite region using any distance measure Compute Draw a random sample {(x 1, y 1 ), (x 2, y 2 ),, (x m, y m )}, an estimate is J. Pei: Computational Statistics 57
58 Example Non-Uniform Distribution In an airport, what is the average distance between the end of a runway and where a plan lands? The landing points are not uniformly distributed Let L be the length of the runway, and p(x) be the probability that a plane lands at the point of distance x to the end of the runway Compute Using a sample {x 1, x 2,, x m } of landing points following the distribution of p(x), an estimate is J. Pei: Computational Statistics 58
59 Estimation of Variance Assumption: the random sample {g(x i )} are independent and have 0 correlations J. Pei: Computational Statistics 59
60 Simulation Estimate Key: simulation of p p is called the target distribution using MC If p is a standard parametric distribution, software tools are available to simulate J. Pei: Computational Statistics 60
61 Inverse CDF For any continuous distribution function f, if U ~ Unif(0, 1), then X = f 1 (U) = inf {x : f(x) U} has a cumulative distribution function equal to f If f -1 is available, we can simulate the target distribution accordingly Known as inverse cumulative distribution function or probability integral transform approach J. Pei: Computational Statistics 61
62 Linear Interpolation If f 1 is not available but f is either available or easily approximated Use a grid of x 1, x 2,, x m spanning the region of support of the target distribution Calculate or approximate u i = f(x i ) Draw U ~ Unif(0, 1) and linearly interpolate between the two nearest grid points for which u i U u j according to J. Pei: Computational Statistics 62
63 Linear Interpolation: Pros and Cons The degree of approximation is deterministic Can be reduced to any desired level by increasing m sufficiently Require a complete approximation to f regardless of the desired sample size Do not generalize to multiple dimensions Less efficient than other approaches. J. Pei: Computational Statistics 63
64 Rejection Sampling Assumptions f(x) can be calculated or approximated We know how to sample from density distribution g and can calculate g(x) We have an envelope e(.) such that e(x) = g(x) / α f(x) for all x that f(x) > 0, and α 1 is a constant Algorithm 1. Sample Y ~ g 2. Sample U ~ Unif(0, 1) 3. Reject Y if U > f(y) / e(y), return to step 1 4. Keep the value of Y as a unit of the random sample of f J. Pei: Computational Statistics 64
65 Illustration J. Pei: Computational Statistics 65
66 Why Does Rejection Sampling Work? J. Pei: Computational Statistics 66
67 Efficiency of Rejection Sampling Rejection sampling can be regarded as uniform sampling from the 2-d region under the curve e, and then discard draws between e and f The larger the area between e and, the lower the efficiency many draws may have to be discarded J. Pei: Computational Statistics 67
68 Squeezed Rejection Sampling For cases where evaluating f is costly Use a non-negative squeezing function s such that s(x) f(x) for any point f(x) > 0 Algorithm Sample Y ~ g Sample U ~ Unif(0, 1) If U s(y) / e(y), keep the value as a unit in the sample Otherwise, check whether U f(y) / e(y). If so, keep the value as a unit of the sample J. Pei: Computational Statistics 68
69 Illustration Saving J. Pei: Computational Statistics 69
70 Envelope & Squeezing Functions? Challenge: how to develop envelope and squeezing functions automatically? Assumptions Let l(x) = log f(x) f(x) > 0 on a (possibly infinite) interval f is log-concave, that is, for any a < b < c in the support region of f, l(a) 2l(b) + l(c) f is continuous and differentiable l (x) exists and decreases monotonically with respect to x (but may have discontinuities) J. Pei: Computational Statistics 70
71 Adaptive Rejection Sampling Evaluate l and l at k points x 1 < x 2 < < x k The tangents at x i and x i+1 intersect at for I = 1,, k 1 For, defines an envelope Similarly, defines a squeezing function J. Pei: Computational Statistics 71
72 Illustration J. Pei: Computational Statistics 72
73 Illustration J. Pei: Computational Statistics 73
74 An Alternative Not Using l Define L i (.) to be the straight line function connecting (x i, l(x i )) and (x i+1, l(x i+1 )) for i = 1,, k 1 defines an envelope J. Pei: Computational Statistics 74
75 Illustration J. Pei: Computational Statistics 75
76 Approximate Simulation Suppose we have an envelope g for target density f Let X = {x 1, x 2,, x m } be a set of values drawn iid from g Define the standardized importance weights as J. Pei: Computational Statistics 76
77 Sampling Importance Resampling Sampling candidates Y 1, Y 2,, Y m iid from g Calculate the standardized importance weights w(y 1 ), w(y 2 ),, w(y m ) Resample X 1, X 2,, X m from Y 1, Y 2,, Y m with replacement with probabilities w(y 1 ), w(y 2 ),, w(y m ) A random variable X drawn with the sampling importance resampling algorithm has distribution that converges to f as m J. Pei: Computational Statistics 77
78 Comparison Rejection sampling is perfect the distribution of a generated draw is exactly f, but requires a random number of draws to obtain a sample of size n Sampling importance resampling uses a predefined number of draws to generate a sample of size n but permits a random degree of approximation to f in the distribution of the sampled points J. Pei: Computational Statistics 78
79 MARKOV CHAIN MONTE CARLO J. Pei: Computational Statistics 79
80 Joint Distribution of a Sequence A sequence of random variables {X (t) }, t = 0, 1, Each X (t) may equal one of a finite or countable infinite number of possible variables, called states The state space is the set of possible values of the random variable X (t) A complete probabilistic specification of X (0), X (1),, X (n), i.e., the joint distribution, is J. Pei: Computational Statistics 80
81 Markov Property The conditional independence assumption Then, the joint distribution can be simplified J. Pei: Computational Statistics 81
82 Markov Chains Let be the probability that the observed state changes from state i at time t to state j at time t + 1 The sequence {X (t) }, t = 0, 1, is a Markov chain if A Markov chain is time homogeneous if, and time-inhomogeneous otherwise J. Pei: Computational Statistics 82
83 Transition Probability Matrix For a time homogeneous Markov chain The state space has s states Transition probability matrix P = [p ij ] describes the one-state transition probability 0 p ij 1 J. Pei: Computational Statistics 83
84 Example J. Pei: Computational Statistics 84
85 Recurrent and Nonnull States A state is recurrent if the chain returns to the state with probability 1 P( {X(t) = i} = ) = 1 Otherwise, the state is transient, that is, we will have a non-zero probability to leave the state and never come back A state is nonnull (aka positive recurrent) if the expected time until recurrence is finite If the state space is finite, every recurrent state is nonnull J. Pei: Computational Statistics 85
86 Ergodic Markov Chains A Markov chain is irreducible if for any i and j, state j can be reached from state i in a finite number of steps There exists m > 0 such that A state j has period d > 0 if the probability of going from state j to state j itself in n steps is 0 for all n not divisible by d A Markov chain is aperiodic if every state has period 1; otherwise it is periodic A Markov chain is ergodic if it is irreducible, aperiodic, and all states are nonnull and recurrent J. Pei: Computational Statistics 86
87 Stationary Distribution Let π be a vector of probabilities such that Σπ i = 1 π i denotes the marginal probability that X (t) = i The marginal distribution of X (t+1) is π T P A distribution π such that π T P = π T is called a stationary distribution for the Markov chain having transition probability matrix P If X (t) follows a stationary distribution, then the marginal distributions of X (t) and X (t+1) are identical J. Pei: Computational Statistics 87
88 Reversible Markov Chains A time-homogeneous Markov chain is reversible if for any i and j in the state space, π i p ij = π j p ji (aka detailed balance condition) π is a stationary distribution for the chain The joint distribution of a sequence of observations is the same no matter the chain is run forwards or backwards J. Pei: Computational Statistics 88
89 Uniqueness of Stationary Distribution If a Markov chain with transition probability matrix P and stationary distribution π is irreducible and aperiodic, then π is unique and π j s are the solutions to the following set of equations J. Pei: Computational Statistics 89
90 Ergodic Theorem If X (1), X (2), are realizations from an irreducible and aperiodic Markov chain with stationary distribution π, then X (n) converges in distribution to the distribution given by π, and for any function h, provided E π { h(x) } exists J. Pei: Computational Statistics 90
91 Bayesian Inference Suppose X has a distribution parameterized by θ Prior distribution f(θ): the density assigned to θ before observing the data Bayes theorem: f(θ x): the posterior density of θ, used for statistical inference about θ c = 1/ f(θ)l(θ x)dθ, often difficult to compute directly J. Pei: Computational Statistics 91
92 Bayes Factor Let be the posterior mode, and θ * be the true value of θ The posterior distribution of converges to N(θ, I(θ ) 1 ) as n, under regularity conditions The observed data should overwhelm any prior as n For two competing hypotheses H 1 and H 2, the Bayes factor is J. Pei: Computational Statistics 92
93 Markov Chain Monte Carlo Idea: construct an irreducible, aperiodic Markov chain whose stationary distribution equals the target distribution f in Monte Carlo Challenge: the distribution of X (t) may differ substantially from f when t is too small and X (t) are serially dependent J. Pei: Computational Statistics 93
94 Metropolis-Hastings Algorithm At t = 0, select X (0) = x (0) drawn at random from some starting distribution g with the requirement f(x (0) ) > 0 Let X (t) = x(t), generate X (t+1) as follows 1. Sample a candidate value X * from a proposal distribution g(. x (t) ) 2. Compute the Metropolis-Hastings ratio R(x (t), X * ) 3. Sample a value for X(t+1) according to the following 4. Increment t and return to step 1 J. Pei: Computational Statistics 94
95 Why Does Metroplis-Hastings Work? Markov: X (t+1) is only dependent on X (t) A user has to check whether the chain is irreducible and aperiodic If so, the chain generated has a unique limiting stationary distribution Intuition We can compute a function proportional to the target distribution, but not exact As more and more sample values are produced, the distribution of values more closely approximates the desired distribution J. Pei: Computational Statistics 95
96 Burn-in Period Sometimes the chain is dependent on the starting value persistently Burn-in period: omit some of the initial realizations of the chain when computing a sample average Challenge: how to design proposal distributions? J. Pei: Computational Statistics 96
97 Independence Chains Set g(x * x (t) ) = g(x * ) for some fixed density g Each candidate value is drawn independently of the past The Metropolis-Hastings ratio is The resulting Markov chain is irreducible and aperiodic if g(x) > 0 whenever f(x) > 0 J. Pei: Computational Statistics 97
98 Example Consider a set of observed data points sampled iid from the mixture distribution Find the value of δ J. Pei: Computational Statistics 98
99 Setting Prior and Proposal Distribution Assume the prior distribution for δ is Unif(0, 1) Proposal distributions Chain 1 uses Beta(1, 1) Equivalent to Unif(0, 1) Chain 2 uses Beta(2, 10), skewed right with mean ~ The values around 0.7 are unlikely to be generated J. Pei: Computational Statistics 99
100 Sample Paths J. Pei: Computational Statistics 100
101 Histograms of δ (t) J. Pei: Computational Statistics 101
102 Random Walk Chains Draw ε ~ h(ε) for some density h Common choices for h include a uniform distribution over a ball centered at the origin, a scaled standard normal distribution, and a scaled Student s t distribution Set X* = x(t) + ε g(x * x (t) ) = h(x * x (t) ) If the support region of f is connected and h is positive in a neighborhood of 0, the resulting chain is irreducible and aperiodic J. Pei: Computational Statistics 102
103 Why Do Random Walk Chains Work? J. Pei: Computational Statistics 103
104 Gibbs Sampling Idea Sampling in a high dimensional space sometimes is difficult Gibbs sampling sequentially samples from univariate conditional distributions Let Suppose that the univariate conditional density of, denoted by, is easily sampled for i = 1,..., p J. Pei: Computational Statistics 104
105 Gibbs Sampling Algorithm Select starting values x (0), and set t = 0 Generate, in turn, Increment t and go to step 2 J. Pei: Computational Statistics 105
106 BOOTSTRAPPING J. Pei: Computational Statistics 106
107 Feature Estimation Consider a cumulative distribution function F We are interested in a feature of F expressed as a function θ = T(F) of F Example: is the mean of F Suppose we have random variables X 1,, X n ~ i.i.d. F (denoted as X ~ F), and x 1,, x n are realization of X 1,, X n If is the empirical distribution function of the observed data, then is an estimate of θ J. Pei: Computational Statistics 107
108 Unknown Distribution Function When F is unknown, we are still interested in or R(X, F), where X is the set of observed data Example:, where is the estimated standard deviation of Bootstrap: approximate the distribution of using the observed data (an estimate of F) J. Pei: Computational Statistics 108
109 Bootstrap Sample of Pseudo-data Also called a pseudo-dataset Each is an i.i.d. random variables with distribution J. Pei: Computational Statistics 109
110 A Simple Example Let {x1, x2, x3} = {1, 2, 6} is an i.i.d. sample from a distribution F that has mean θ We want to estimate sample mean Let consist elements drawn i.i.d. from -- there are 27 possible outcomes for Let be the empirical distribution function of such as sample and There are only 10 possible outcomes of J. Pei: Computational Statistics 110
111 Possible Outcomes of J. Pei: Computational Statistics 111
112 The Bootstrap Principle Approximate using Example: 25/27 (~93%) confidence interval for θ is (4/3, 14/3) using quantiles of the distribution of J. Pei: Computational Statistics 112
113 Non-parametric Bootstrap If the sample size is big, the number of potential bootstrap pseudo-datasets is very large impossible to enumerate all possible bootstrap pseudo-dataset Draw B independent random bootstrap pseudo-datasets for i = 1,, B Approximate R(X, F) using the empirical distribution of for i = 1,, B The simulation error can be arbitrarily small by increasing B J. Pei: Computational Statistics 113
114 A Simple Example Draw with replacement from {1, 2, 6} with equal probability Each bootstrap pseudo-dataset produces a corresponding estimation of J. Pei: Computational Statistics 114
115 Bootstrap Bias Correction If we set bias of The mean is bootstrap by, estimated using, it is the J. Pei: Computational Statistics 115
116 Ensemble Classifiers C*(x)=Vote(C1(x),, Ck(x)) Figure from [Tan, Steinbach, Kumar] J. Pei: Computational Statistics 116
117 Why May Ensemble Method Work? Suppose there are two classes and each base classifier has an error rate of 35% What if we use 25 base classifiers? If all base classifiers are identical, the ensemble error rate is still 35% If base classifiers are independent, the ensemble makes a wrong prediction only if more than half of the base classifiers are wrong 25 i= i i 25 i = 0.06 J. Pei: Computational Statistics 117
118 Ensemble Error Rate Figure from [Tan, Steinbach, Kumar] J. Pei: Computational Statistics 118
119 Ensemble Classifiers When? The base classifiers should be independent of each other Each base classifier should do better than a classifier that performs random guessing J. Pei: Computational Statistics 119
120 How to Construct Ensemble? Manipulating the training set: derive multiple training sets and build a base classifier on each Manipulating the input features: use only a subset of features in a base classifier Manipulating the class labels: if there are many classes, in a classifier, randomly divide the classes into two subsets A and B; for a test case, if a base classifier predicts its class as A, all classes in A receive a vote Manipulating the learning algorithm, e.g., using different network configuration in ANN J. Pei: Computational Statistics 120
121 Bootstrap Given an original training set T, derive a tranining set T by repeatedly uniformly sampling with replacement If T has n tuples, each tuple has a probability p = 1 - (1-1/n) n of being selected in T When n, p 1-1/e Use the tuples not in T as the test set J. Pei: Computational Statistics 121
122 Bagging Run bootstrap k times to obtain k base classifiers A test instance is assigned to the class that receives the highest number of votes Strength: reduce the variance of base classifiers good for unstable base classifiers Unstable classifiers: sensitive to minor perturbations in the training set, e.g., decision trees, associative classifiers, and ANN For stable classifiers (e.g., linear discriminant analysis and knn classifiers), bagging may even degrade the performance since the training sets are smaller Less overfitting on noisy data J. Pei: Computational Statistics 122
123 Bootstrap in Testing Use a bootstrap sample as the training set, use the tuples not in the training set as the test set.632 bootstrap: compute the overall accuracy by combining the accuracies of each bootstrap sample with the accuracy computed from a classifier using the whole data set as the training set k 1 acc 632bootstrap = (0.632 ε i acc k. all 1 ) J. Pei: Computational Statistics 123
124 NONPARAMETRIC DENSITY ESTIMATION J. Pei: Computational Statistics 124
125 Density Estimation Using observations of random variables X 1, X 2,, X n sampled independently from a density function f, estimate f Why density estimation? Assess multimodality, skew, tail behavior, Decision making, classification, summarize Bayesian posteriors A useful presentation tool to summarize a distribution A tool in some other methods, such as MCMC J. Pei: Computational Statistics 125
126 General Idea Assume a parametric model X 1, X 2,, X n ~ iid f X θ, where θ is a very low-dimensional parameter vector Estimate parameter using some estimation paradigm, such as maximum likelihood, Bayesian, or method-of-moments estimation The resulting density estimate at x is If the assumed model f X θ is incorrect, the inferential error is big J. Pei: Computational Statistics 126
127 Moments When a set of points representing a probability density 0 th moment: total of probability, i.e., 1 1 st moment: mean 2 nd moment: variance 3 rd moment: skewness Normalized moment: J. Pei: Computational Statistics 127
128 Method of Moments Idea: use relations between population moments and parameters of interest To estimate k unknown parameters θ = (θ 1, θ 2,, θ k ) of distribution f X (x θ), if the first k moments of the true distribution can be expressed as functions of θ: Draw a sample of n units x 1, x 2,, x n, then the i-th sample moment is An unbiased estimation of population moments Solve equations J. Pei: Computational Statistics 128
129 Nonparametric Density Estimation Assume very little about the form of f Use mainly local information to estimate f at a point x J. Pei: Computational Statistics 129
130 Motivation If f is smooth enough and we observe a point X i = x i, we assume f assigns some density not only at xi but also in a region around x i To estimate f from X 1, X 2,, X n ~ iid f, we accumulate localized probability density contributions in regions around each X i J. Pei: Computational Statistics 130
131 h-neighborhood To estimate the density at a point x, consider a region of width dx = 2h, centered at x h is a parameter to be set The proportion of the observations that fall in the interval γ = [x h, x + h] approximates the density at x Take, then Here, 1 {A} = 1 if A is true, otherwise 0 J. Pei: Computational Statistics 131
132 Estimation is the number of sample points in the interval γ N γ is a Bin(n, p(γ)) random variable, where E[N γ /n] = p(γ) var[n γ /n] = p(γ)(1 p(γ)) / n To obtain accurate estimation, h 0 and n, since var[n γ /n] 0 when n Thus, we should require nh, h 0 as n J. Pei: Computational Statistics 132
133 Performance Measures Integrated squared error (ISE) Mean integrated squared error (MISE) MISE(h) = E{ISE(h)} MISE(h) is the accumulation of local mean squared error at each x J. Pei: Computational Statistics 133
134 Kernel Functions Recall Every point in the h-neighborhood weights the same Intuition: closer points should be weighted more Idea: use a kernel function A kernel function K() is a non-negative realvalued integrate-able function such that and K(-u) = k(u) for any u J. Pei: Computational Statistics 134
135 Kernel Density Estimation Using a kernel function K, estimate by J. Pei: Computational Statistics 135
136 Bandwidth Parameter h is called the (fixed) bandwidth The bandwidth has strong influence on the estimator If h is too small, tend to assign probability density too locally near observed data, resulting in a very wiggly estimated density function with many false modes If h is too large, spread probability density contributions too diffusely averaging over neighborhoods that are too large smooths away important features of f The bandwidth controls the trade-off between the bias and variance of the estimator J. Pei: Computational Statistics 136
137 Example J. Pei: Computational Statistics 137
138 Roughness Measure the roughness of a function g Assume R(K) < and f and f are bounded and continuous, R(f ) < Recall Thus, J. Pei: Computational Statistics 138
139 Computing the Bias Using the Taylor series expansion, Since K is symmetric about 0, Thus, J. Pei: Computational Statistics 139
140 Computing the Variance Using a similar strategy Thus, where (asymptotic mean integrated squared error) J. Pei: Computational Statistics 140
141 Optimal Bandwidth Set h at an intermediate value that avoids excessive bias and excessive variability Theoretically, is the optimal bandwidth Not practically useful since it depends on f, which is to be estimated Turn to some heuristic methods J. Pei: Computational Statistics 141
142 Cross-Validation Ideas Relate h to some quantified quality measure Q(h) on as an estimator of f Optimize and find the optimal h Problem: we use the observed data to calculate and use the same data to evaluate the quality of may lead to overfitting Remedy: use some data the to learn a model and the rest of data to evaluate J. Pei: Computational Statistics 142
143 Cross-Validation To evaluate the quality of at the i-th data point, use all the data except for the i-th data point to train the model Denote the estimated density at X i using a kernel density estimator with all the observations except for X i by Set a function of The bandwidth estimated can be highly sensitive to sampling variability J. Pei: Computational Statistics 143
144 Using Pseudo-Likelihood Set to pseudo-likelihood (PL) Optimize PL(h) and find the optimal h Simple and intuitively appealing Estimation produced is too wiggly and sensitive to outliers J. Pei: Computational Statistics 144
145 Unbiased Cross-Validation Criterion Rewrite the integrated squared error as The last term is constant The second term can be estimated by Thus, minimizing h can find a good h It is called the unbiased cross-validation criterion because E{ UCV (h) + R(f)} = MISE (h) Aka least squares cross-validation because choosing h to minimize UCV (h) minimizes the integrated squared error between and f with respect to J. Pei: Computational Statistics 145
146 Using a Normal Kernel If analytic evaluation of is infeasible, use a different kernel that permits an analytic simplification If a normal kernel ϕ is used, Slow convergence to the optimum Strong dependence on the observed data when applied to different data sets drawn from the same distribution, may yield very different answers J. Pei: Computational Statistics 146
147 Plug-in Methods Apply a pilot bandwidth to estimate one or more important features of f The bandwidth for estimating f itself is then estimated at a second stage using a criterion depending on the estimated features Theoretically, is the optimal bandwidth Key: estimate R(f ) Very effective in many applications J. Pei: Computational Statistics 147
148 Silverman s Rule of Thumb Replace f by a normal density with variance set to match the sample variance Equivalently, estimate R(f ) by, where ϕ is the standard normal density function J. Pei: Computational Statistics 148
149 Empirical Estimation of R(f ) Empirical estimation of R(f ) in The kernel based estimator is h 0 is the bandwidth and L is a sufficiently differentiable kernel used to estimate f The best bandwidth for estimating f differs from the best bandwidth for estimating f or R(f ) A larger bandwidth is required for estimating f, that is h 0 > h J. Pei: Computational Statistics 149
150 The Sheather-Jones Method A two-stage process A simple rule of thumb is used to calculate the bandwidth h 0 Use h 0 to estimate R(f ) Compute h using J. Pei: Computational Statistics 150
151 An Implementation For univariate kernel density estimation with pilot kernel L = ϕ, the Sheather-Jones bandwidth is the value of h that solves the equation J. Pei: Computational Statistics 151
152 An Implementation (Cont d) Here, Estimation using, for example, Newton s method J. Pei: Computational Statistics 152
153 Interquartile Range (IQR) A measure of statistical dispersion, aka midspread or middle fifty The difference between the upper and lower quartiles: IQR = Q 3 Q 1 IQR = 10 J. Pei: Computational Statistics 153
154 Maximal Smoothing Principle The bandwidth should be selected to discourage false modes, producing an estimate that shows modes only where the data indisputably require them Replace R(f ) with the most conservative (i.e., smallest) possible value Consider all h that would minimize, and select the largest h The right-hand side of should be maximized with respect to f J. Pei: Computational Statistics 154
155 Implementation Set J. Pei: Computational Statistics 155
Computational statistics
Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationMonte Carlo Methods. Leon Gu CSD, CMU
Monte Carlo Methods Leon Gu CSD, CMU Approximate Inference EM: y-observed variables; x-hidden variables; θ-parameters; E-step: q(x) = p(x y, θ t 1 ) M-step: θ t = arg max E q(x) [log p(y, x θ)] θ Monte
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationSpring 2012 Math 541B Exam 1
Spring 2012 Math 541B Exam 1 1. A sample of size n is drawn without replacement from an urn containing N balls, m of which are red and N m are black; the balls are otherwise indistinguishable. Let X denote
More informationComputer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo
Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain
More information16 : Markov Chain Monte Carlo (MCMC)
10-708: Probabilistic Graphical Models 10-708, Spring 2014 16 : Markov Chain Monte Carlo MCMC Lecturer: Matthew Gormley Scribes: Yining Wang, Renato Negrinho 1 Sampling from low-dimensional distributions
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationStatistics: Learning models from data
DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial
More informationOptimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.
Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables
More informationMCMC algorithms for fitting Bayesian models
MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models
More informationAnswers and expectations
Answers and expectations For a function f(x) and distribution P(x), the expectation of f with respect to P is The expectation is the average of f, when x is drawn from the probability distribution P E
More informationBAYESIAN DECISION THEORY
Last updated: September 17, 2012 BAYESIAN DECISION THEORY Problems 2 The following problems from the textbook are relevant: 2.1 2.9, 2.11, 2.17 For this week, please at least solve Problem 2.3. We will
More informationCS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling
CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling Professor Erik Sudderth Brown University Computer Science October 27, 2016 Some figures and materials courtesy
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More informationStat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC
Stat 451 Lecture Notes 07 12 Markov Chain Monte Carlo Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Chapters 8 9 in Givens & Hoeting, Chapters 25 27 in Lange 2 Updated: April 4, 2016 1 / 42 Outline
More informationBagging During Markov Chain Monte Carlo for Smoother Predictions
Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods
More informationBayesian Inference and MCMC
Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the
More informationBayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence
Bayesian Inference in GLMs Frequentists typically base inferences on MLEs, asymptotic confidence limits, and log-likelihood ratio tests Bayesians base inferences on the posterior distribution of the unknowns
More informationPattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods
Pattern Recognition and Machine Learning Chapter 11: Sampling Methods Elise Arnaud Jakob Verbeek May 22, 2008 Outline of the chapter 11.1 Basic Sampling Algorithms 11.2 Markov Chain Monte Carlo 11.3 Gibbs
More informationEco517 Fall 2013 C. Sims MCMC. October 8, 2013
Eco517 Fall 2013 C. Sims MCMC October 8, 2013 c 2013 by Christopher A. Sims. This document may be reproduced for educational and research purposes, so long as the copies contain this notice and are retained
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More information17 : Markov Chain Monte Carlo
10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo
More informationStat 516, Homework 1
Stat 516, Homework 1 Due date: October 7 1. Consider an urn with n distinct balls numbered 1,..., n. We sample balls from the urn with replacement. Let N be the number of draws until we encounter a ball
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More information6 Markov Chain Monte Carlo (MCMC)
6 Markov Chain Monte Carlo (MCMC) The underlying idea in MCMC is to replace the iid samples of basic MC methods, with dependent samples from an ergodic Markov chain, whose limiting (stationary) distribution
More informationMarkov Chain Monte Carlo (MCMC)
Markov Chain Monte Carlo (MCMC Dependent Sampling Suppose we wish to sample from a density π, and we can evaluate π as a function but have no means to directly generate a sample. Rejection sampling can
More informationPerformance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationFall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.
1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n
More informationSTAT 425: Introduction to Bayesian Analysis
STAT 425: Introduction to Bayesian Analysis Marina Vannucci Rice University, USA Fall 2017 Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 2) Fall 2017 1 / 19 Part 2: Markov chain Monte
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationReview. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda
Review DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall16 Carlos Fernandez-Granda Probability and statistics Probability: Framework for dealing with
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationMarkov chain Monte Carlo
Markov chain Monte Carlo Markov chain Monte Carlo (MCMC) Gibbs and Metropolis Hastings Slice sampling Practical details Iain Murray http://iainmurray.net/ Reminder Need to sample large, non-standard distributions:
More informationClustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.
Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationThe Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision
The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that
More informationLecture 4: Types of errors. Bayesian regression models. Logistic regression
Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture
More informationPrinciples of Bayesian Inference
Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters
More informationApril 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning
for for Advanced Topics in California Institute of Technology April 20th, 2017 1 / 50 Table of Contents for 1 2 3 4 2 / 50 History of methods for Enrico Fermi used to calculate incredibly accurate predictions
More informationLecture 8: The Metropolis-Hastings Algorithm
30.10.2008 What we have seen last time: Gibbs sampler Key idea: Generate a Markov chain by updating the component of (X 1,..., X p ) in turn by drawing from the full conditionals: X (t) j Two drawbacks:
More informationComputer Intensive Methods in Mathematical Statistics
Computer Intensive Methods in Mathematical Statistics Department of mathematics johawes@kth.se Lecture 16 Advanced topics in computational statistics 18 May 2017 Computer Intensive Methods (1) Plan of
More informationStatistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation
Statistics - Lecture One Charlotte Wickham wickham@stat.berkeley.edu http://www.stat.berkeley.edu/~wickham/ Outline 1. Basic ideas about estimation 2. Method of Moments 3. Maximum Likelihood 4. Confidence
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationBAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA
BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA Intro: Course Outline and Brief Intro to Marina Vannucci Rice University, USA PASI-CIMAT 04/28-30/2010 Marina Vannucci
More informationSAMPLING ALGORITHMS. In general. Inference in Bayesian models
SAMPLING ALGORITHMS SAMPLING ALGORITHMS In general A sampling algorithm is an algorithm that outputs samples x 1, x 2,... from a given distribution P or density p. Sampling algorithms can for example be
More informationLikelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University
Likelihood, MLE & EM for Gaussian Mixture Clustering Nick Duffield Texas A&M University Probability vs. Likelihood Probability: predict unknown outcomes based on known parameters: P(x q) Likelihood: estimate
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationThe Bayes classifier
The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal
More informationIntroduction to MCMC. DB Breakfast 09/30/2011 Guozhang Wang
Introduction to MCMC DB Breakfast 09/30/2011 Guozhang Wang Motivation: Statistical Inference Joint Distribution Sleeps Well Playground Sunny Bike Ride Pleasant dinner Productive day Posterior Estimation
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationMath 494: Mathematical Statistics
Math 494: Mathematical Statistics Instructor: Jimin Ding jmding@wustl.edu Department of Mathematics Washington University in St. Louis Class materials are available on course website (www.math.wustl.edu/
More informationHypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3
Hypothesis Testing CB: chapter 8; section 0.3 Hypothesis: statement about an unknown population parameter Examples: The average age of males in Sweden is 7. (statement about population mean) The lowest
More informationComputer Vision Group Prof. Daniel Cremers. 14. Sampling Methods
Prof. Daniel Cremers 14. Sampling Methods Sampling Methods Sampling Methods are widely used in Computer Science as an approximation of a deterministic algorithm to represent uncertainty without a parametric
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationAlgorithm-Independent Learning Issues
Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning
More informationTheory of Maximum Likelihood Estimation. Konstantin Kashin
Gov 2001 Section 5: Theory of Maximum Likelihood Estimation Konstantin Kashin February 28, 2013 Outline Introduction Likelihood Examples of MLE Variance of MLE Asymptotic Properties What is Statistical
More informationSTATS 200: Introduction to Statistical Inference. Lecture 29: Course review
STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout
More informationSTA 294: Stochastic Processes & Bayesian Nonparametrics
MARKOV CHAINS AND CONVERGENCE CONCEPTS Markov chains are among the simplest stochastic processes, just one step beyond iid sequences of random variables. Traditionally they ve been used in modelling a
More informationOn Markov chain Monte Carlo methods for tall data
On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational
More informationComputer Vision Group Prof. Daniel Cremers. 11. Sampling Methods
Prof. Daniel Cremers 11. Sampling Methods Sampling Methods Sampling Methods are widely used in Computer Science as an approximation of a deterministic algorithm to represent uncertainty without a parametric
More informationIntroduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov
More informationCh 4. Linear Models for Classification
Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,
More informationLecture 8: Bayesian Estimation of Parameters in State Space Models
in State Space Models March 30, 2016 Contents 1 Bayesian estimation of parameters in state space models 2 Computational methods for parameter estimation 3 Practical parameter estimation in state space
More informationMonte Carlo Studies. The response in a Monte Carlo study is a random variable.
Monte Carlo Studies The response in a Monte Carlo study is a random variable. The response in a Monte Carlo study has a variance that comes from the variance of the stochastic elements in the data-generating
More informationIEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm
IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.
More informationMonte Carlo in Bayesian Statistics
Monte Carlo in Bayesian Statistics Matthew Thomas SAMBa - University of Bath m.l.thomas@bath.ac.uk December 4, 2014 Matthew Thomas (SAMBa) Monte Carlo in Bayesian Statistics December 4, 2014 1 / 16 Overview
More informationA quick introduction to Markov chains and Markov chain Monte Carlo (revised version)
A quick introduction to Markov chains and Markov chain Monte Carlo (revised version) Rasmus Waagepetersen Institute of Mathematical Sciences Aalborg University 1 Introduction These notes are intended to
More informationLecture 7 and 8: Markov Chain Monte Carlo
Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani
More informationProbability and Information Theory. Sargur N. Srihari
Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal
More informationSC7/SM6 Bayes Methods HT18 Lecturer: Geoff Nicholls Lecture 2: Monte Carlo Methods Notes and Problem sheets are available at http://www.stats.ox.ac.uk/~nicholls/bayesmethods/ and via the MSc weblearn pages.
More informationEM Algorithm II. September 11, 2018
EM Algorithm II September 11, 2018 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data
More informationLearning the hyper-parameters. Luca Martino
Learning the hyper-parameters Luca Martino 2017 2017 1 / 28 Parameters and hyper-parameters 1. All the described methods depend on some choice of hyper-parameters... 2. For instance, do you recall λ (bandwidth
More informationLecture 6: Markov Chain Monte Carlo
Lecture 6: Markov Chain Monte Carlo D. Jason Koskinen koskinen@nbi.ku.dk Photo by Howard Jackman University of Copenhagen Advanced Methods in Applied Statistics Feb - Apr 2016 Niels Bohr Institute 2 Outline
More informationStat 451 Lecture Notes Monte Carlo Integration
Stat 451 Lecture Notes 06 12 Monte Carlo Integration Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Chapter 6 in Givens & Hoeting, Chapter 23 in Lange, and Chapters 3 4 in Robert & Casella 2 Updated:
More informationFoundations of Nonparametric Bayesian Methods
1 / 27 Foundations of Nonparametric Bayesian Methods Part II: Models on the Simplex Peter Orbanz http://mlg.eng.cam.ac.uk/porbanz/npb-tutorial.html 2 / 27 Tutorial Overview Part I: Basics Part II: Models
More informationMarkov Chain Monte Carlo methods
Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As
More informationMarkov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018
Graphical Models Markov Chain Monte Carlo Inference Siamak Ravanbakhsh Winter 2018 Learning objectives Markov chains the idea behind Markov Chain Monte Carlo (MCMC) two important examples: Gibbs sampling
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationBayesian Regression Linear and Logistic Regression
When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we
More information27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling
10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel
More informationPart III. A Decision-Theoretic Approach and Bayesian testing
Part III A Decision-Theoretic Approach and Bayesian testing 1 Chapter 10 Bayesian Inference as a Decision Problem The decision-theoretic framework starts with the following situation. We would like to
More informationRobert Collins CSE586, PSU Intro to Sampling Methods
Robert Collins Intro to Sampling Methods CSE586 Computer Vision II Penn State Univ Robert Collins A Brief Overview of Sampling Monte Carlo Integration Sampling and Expected Values Inverse Transform Sampling
More informationAlgorithmisches Lernen/Machine Learning
Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines
More informationIntroduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016
Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016 EPSY 905: Intro to Bayesian and MCMC Today s Class An
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More informationTest Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics
Test Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics The candidates for the research course in Statistics will have to take two shortanswer type tests
More informationExtreme Value Analysis and Spatial Extremes
Extreme Value Analysis and Department of Statistics Purdue University 11/07/2013 Outline Motivation 1 Motivation 2 Extreme Value Theorem and 3 Bayesian Hierarchical Models Copula Models Max-stable Models
More informationGaussian Models
Gaussian Models ddebarr@uw.edu 2016-04-28 Agenda Introduction Gaussian Discriminant Analysis Inference Linear Gaussian Systems The Wishart Distribution Inferring Parameters Introduction Gaussian Density
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationRobert Collins CSE586, PSU Intro to Sampling Methods
Intro to Sampling Methods CSE586 Computer Vision II Penn State Univ Topics to be Covered Monte Carlo Integration Sampling and Expected Values Inverse Transform Sampling (CDF) Ancestral Sampling Rejection
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More information