Computational Statistics. Jian Pei School of Computing Science Simon Fraser University

Size: px
Start display at page:

Download "Computational Statistics. Jian Pei School of Computing Science Simon Fraser University"

Transcription

1 Computational Statistics Jian Pei School of Computing Science Simon Fraser University

2 BASIC OPTIMIZATION METHODS J. Pei: Computational Statistics 2

3 Why Optimization? In statistical inference, we have to conduct maximum likelihood estimation Most functions cannot be analytically optimized Example: how to maximize? Set The equation does not have an analytic solution J. Pei: Computational Statistics 3

4 Problem Formulation Let l be the log likelihood function If is the maximum likelihood estimation (MLE), then it is a solution to the following score equation Where J. Pei: Computational Statistics 4

5 Example J. Pei: Computational Statistics 5

6 The Bisection Method Idea If g is continuous on [a 0, b 0 ] and g (a 0 )g (b 0 ) 0, then there exists at least one x* in [a 0, b 0 ] such that g (x*) = 0 Systematically shrink the interval from [a 0, b 0 ] to [a 1, b 1 ] to [a 2, b 2 ] and so on J. Pei: Computational Statistics 6

7 The Bisection Method Let be the starting value, then Where J. Pei: Computational Statistics 7

8 Example J. Pei: Computational Statistics 8

9 Stopping Rule Stop if the procedure appears to have achieved satisfactory convergence; or if it appears unlikely to do so soon The absolute convergence criterion For bisection, True error tolerance of when, that is holds J. Pei: Computational Statistics 9

10 Relative Convergence Criterion Criterion May not work well when x* is close to 0 Alternatively, The bisection method in theory guaranteed to converge to a root J. Pei: Computational Statistics 10

11 Stopping in Failure to Converge MLE may not be unique The convergence may not happen due to computational precision Stopping rules that flag a failure to converge Stop after N iterations Some convergence measures fail to decrease or cycle over several iterations J. Pei: Computational Statistics 11

12 Bracketing Methods Bound a root within a sequence of nested intervals of decreasing length Bisection is a bracketing method Slow! If g is continuous on [a 0, b 0 ], a root can be found Do not rely on the existence, behavior, or ease of deriving of g J. Pei: Computational Statistics 12

13 Newton-Raphson Iteration Assumption: g is continuously differentiable and g (x*) 0 At iteration t, approximate g (x*) by the linear Taylor series expansion That is, The updating equation: Where J. Pei: Computational Statistics 13

14 Example J. Pei: Computational Statistics 14

15 Newton s Method May Not Converge J. Pei: Computational Statistics 15

16 An Interesting Situation Suppose g is continuous and g (x*) 0 There exists a neighborhood of x* within which g (x) 0 for all x Define Using a Taylor expansion, for some q between x (t) and x* That is, That is, J. Pei: Computational Statistics 16

17 An Interesting Situation (cont d) Consider a neighborhood of x*, for δ > 0, Define Then,, i.e., Thus, we can choose a δ such that δc(δ) < 1 J. Pei: Computational Statistics 17

18 An Interesting Situation (cont d) If, then from We have Assuming a starting point satisfying J. Pei: Computational Statistics 18

19 Newton s Method Converges If g is continuous and x* is a simple root of g, then there exists a neighborhood of x* for which Newton s method converges to x* when started from any x (0) in that neighborhood A root is simple if it cannot be written as the sum of two roots Some other conditions where Newton s method converges exist J. Pei: Computational Statistics 19

20 Convergence Order Measuring how fast a root-finding method is A method has convergence of order β > 0 if It converges,, and Convergence speed c 0 is a constant J. Pei: Computational Statistics 20

21 Convergence Order of Newton s Method Recall Then, If Newton s method converges, then Newton s method has quadratic convergence, that is β = 2 J. Pei: Computational Statistics 21

22 Fisher Scoring Replace in the Newton update with is the Fisher information, and can be approximated by, is the expected Fisher information evaluated at The updating equation is J. Pei: Computational Statistics 22

23 Comparison Fisher scoring and Newton s method have the same asymptotic properties Fisher scoring works better in the beginning to make rapid improvements Newton s method works better for refinement near the end J. Pei: Computational Statistics 23

24 Secant Method Newton s method becomes inconvenient if g (x (t) ) is hard to calculate Use the discrete-difference approximation The updating equation becomes, for t > 0, J. Pei: Computational Statistics 24

25 Convergence Order The secant method has the convergence of order of The secant method has a slower order of convergence than Newton s method J. Pei: Computational Statistics 25

26 Example J. Pei: Computational Statistics 26

27 THE EM METHOD J. Pei: Computational Statistics 27

28 A Frequentist Setting Conceive of observed data θ generated from random variables X along with some missing or unobserved data from random variable Z Z is often called latent Envision complete data generated from Y = (X, Z) Given observed data x, we wish to maximize a likelihood L(θ x) often hard to work with Easier to deal with the densities of Y θ and Z (x, θ) EM uses the easier densities to estimate L(θ x) J. Pei: Computational Statistics 28

29 Latent Z Conceptually, Z may be viewed as having been removed from the complete Y through the application of some many-to-fewer mapping X = M(Y) Let f X (x θ) be the density of the observed data Let f Y (y θ) be the density of the complete data A marginalization model Observe X having density Conditional density of the missing data J. Pei: Computational Statistics 29

30 A Bayesian Setting Estimate the mode of a posterior distribution f(θ x) Consideration of unobserved random variables ψ in addition to the parameters of interest θ Missing data may not be really missing a conceptual tool to simplify problems J. Pei: Computational Statistics 30

31 Marginalization L(θ x) is a marginalization of the completedata likelihood L(θ y) = L(θ x, z) Alternatively, assume missing parameters ψ of no interest No difference under the Bayesian pardigm J. Pei: Computational Statistics 31

32 The EM Algorithm: Objectives Iteratively seek to maximize L(θ x) with respect to θ Let θ (t) be the estimated maximizer at iteration t, for t = 0, 1, Let Q(θ θ (t) ) be the expectation of the joint log likelihood for the complete data, conditional on the observed data X = x J. Pei: Computational Statistics 32

33 The EM Algorithm: Steps The E step: compute Q(θ θ (t) ) The M step: maximize Q(θ θ (t) ) with respect to θ and set θ (t+1) to the maximizer of Q Return to the E step until a stopping criterion is met J. Pei: Computational Statistics 33

34 Fuzzy Clustering Each point x i takes a probability w ij to belong to a cluster C j Requirements For each point x i, k j= 1 w ij = 1 For each cluster C j m 0 < i=1 w ij < m J. Pei: Computational Statistics 34

35 Fuzzy C-Means (FCM) Select an initial fuzzy pseudo-partition, i.e., assign values to all the w ij Repeat Recompute the fuzzy pseudo-partition, i.e., the w ij Compute the centroid of each cluster using the fuzzy pseudo-partition Until the centroids do not change (or the change is below some threshold) J. Pei: Computational Statistics 35

36 Critical Details Optimization on sum of the squared error k m (SSE): p 2 SSE( C Computing centroids: 1,, Ck ) = wij dist( xi, c j ) j= 1 i= 1 Updating the fuzzy pseudo-partition w ij = (1/ dist( x i, c j ) 2 ) 1 p 1 c j = k q= 1 m i= 1 w p ij x i / (1/ dist( x m i= 1 i, c w q p ij ) 2 ) 1 p 1 When p=2 w ij = 1/ dist( x i, c j ) 2 k q= 1 1/ dist( x i, c q ) 2 J. Pei: Computational Statistics 36

37 Choice of P When p 1, FCM behaves like traditional k-means When p is larger, the cluster centroids approach the global centroid of all data points The partition becomes fuzzier as p increases J. Pei: Computational Statistics 37

38 Effectiveness J. Pei: Computational Statistics 38

39 Mixture Models A cluster can be modeled as a probability distribution Practically, assume a distribution can be approximated well using multivariate normal distribution Multiple clusters are a mixture of different probability distributions A data set is a set of observations from a mixture of models J. Pei: Computational Statistics 39

40 Object Probability Suppose there are k clusters and a set X of m objects Let the j-th cluster have parameter θ j = (µ j, σ j ) The probability that a point is in the j-th cluster is w j, w w k = 1 The probability of an object x is k prob( x Θ) = w j p j ( x θ j ) m i= 1 j= 1 prob( X Θ) = prob( x Θ) = w p ( x θ ) i m k i= 1 j= 1 j j i j J. Pei: Computational Statistics 40

41 Example prob ( x µ ) 1 2 2σ ( ) x i Θ = e 2πσ 2 θ1 = ( 4,2) θ2 = (4,2) prob( x Θ) = 2 1 e 2π ( x+ 4) e 2π ( x 4) 8 2 J. Pei: Computational Statistics 41

42 Maximal Likelihood Estimation Maximum likelihood principle: if we know a set of objects are from one distribution, but do not know the parameter, we can choose the parameter maximizing the probability 2 m ( x µ ) Maximize 1 2 prob 2σ ( x Θ = i ) e 2πσ j= 1 Equivalently, maximize log prob( X Θ) = m i= 1 ( xi µ ) 2 2σ 2 0.5mlog 2π mlogσ J. Pei: Computational Statistics 42

43 EM Algorithm Expectation Maximization algorithm Select an initial set of model parameters Repeat Expectation Step: for each object, calculate the probability that it belongs to each distribution θ i, i.e., prob(x i θ i ) Maximization Step: given the probabilities from the expectation step, find the new estimates of the parameters that maximize the expected likelihood Until the parameters are stable J. Pei: Computational Statistics 43

44 Convergence The log of the observed-data density can be rewritten as Thus, where the expectations are taken with respect to the distribution of Z (x, θ (t) ) Define Then, J. Pei: Computational Statistics 44

45 H(θ θ (t) ) Maximized by θ = θ (t) For any θ θ (t), H(θ θ (t) ) < H(θ (t) θ (t) ) If θ (t+1) is chosen to maximize Q(θ θ (t) ) with respect to θ, since Q increases and H decreases, with strict inequality when Q(θ (t+1) θ (t) ) > Q(θ (t) θ (t) ), J. Pei: Computational Statistics 45

46 Generalized EM Algorithm Standard EM: choose θ (t+1) at each iteration to maximize Q(θ θ (t) ) with respect to θ Generalized EM (GEM): choose any θ (t+1) such that Q(θ (t+1) θ (t) ) > Q(θ (t) θ (t) ) Both EM and GEM converge J. Pei: Computational Statistics 46

47 Order of Convergence The EM algorithm defines a mapping θ (t+1) = Ψ(θ (t) ) When EM converges, it converges to a fixed point of the mapping The global rate of EM convergence is Let Ψ (θ) be the Jacobian matrix whose (i, j)-th element is dψ i (θ)/dθ j Then, ρ equals the largest eigenvalue of Ψ (θ) when is positive definite J. Pei: Computational Statistics 47

48 Rationale Jaccobian matrix Where F: R n R m If function f is differentiable at a point x, then the Jacobian matrix defines a linear map R n R m, which is the best linear approximation of the function f near the point x J. Pei: Computational Statistics 48

49 Implications Linear convergence EM suffers slower convergence when the proportion of missing information is larger Ease of implementation J. Pei: Computational Statistics 49

50 Hessian Matrix For function f: R n R such that all second partial derivatives of f exist and are continuous over the domain of the function, the Hessian matrix (or simply Hessian) is Using Taylor expansion, J. Pei: Computational Statistics 50

51 Missing-Information Principle Recall Take second partial derivatives and negate both sides Rewrite to The missing-information principle : the observed information : the complete information : the missing information J. Pei: Computational Statistics 51

52 Estimating Covariance The covariance matrix for is Where the variance is taken with respect to f Z X J. Pei: Computational Statistics 52

53 MONTE CARLO METHODS J. Pei: Computational Statistics 53

54 Estimation of a Definite Integral Motivation example: estimate Situations where Monte Carlo is not needed The integral can be evaluated in closed form D is of low dimensionality (e.g., 1-2) Decomposition: if f(x) can be rewritten as f(x) = g(x) p(x) such that p(x) can be treated as a probability density function, that is, p(x) 0 and J. Pei: Computational Statistics 54

55 Estimation How to estimate? Draw a random sample {x 1, x 2,, x m } from the distribution with probability density p J. Pei: Computational Statistics 55

56 Monte Carlo Method (PDF decomposition) Decompose a function of interest to include a probability density function as a factor Identify an expected value Use a sample to estimate the expected value The PDF decomposition may not be unique Typically used when dimensionality is high J. Pei: Computational Statistics 56

57 Example Uniform Distribution In the [0, 1] x [0, 1] square, what is the average distance between two points? In general, we can answer this question for any finite region using any distance measure Compute Draw a random sample {(x 1, y 1 ), (x 2, y 2 ),, (x m, y m )}, an estimate is J. Pei: Computational Statistics 57

58 Example Non-Uniform Distribution In an airport, what is the average distance between the end of a runway and where a plan lands? The landing points are not uniformly distributed Let L be the length of the runway, and p(x) be the probability that a plane lands at the point of distance x to the end of the runway Compute Using a sample {x 1, x 2,, x m } of landing points following the distribution of p(x), an estimate is J. Pei: Computational Statistics 58

59 Estimation of Variance Assumption: the random sample {g(x i )} are independent and have 0 correlations J. Pei: Computational Statistics 59

60 Simulation Estimate Key: simulation of p p is called the target distribution using MC If p is a standard parametric distribution, software tools are available to simulate J. Pei: Computational Statistics 60

61 Inverse CDF For any continuous distribution function f, if U ~ Unif(0, 1), then X = f 1 (U) = inf {x : f(x) U} has a cumulative distribution function equal to f If f -1 is available, we can simulate the target distribution accordingly Known as inverse cumulative distribution function or probability integral transform approach J. Pei: Computational Statistics 61

62 Linear Interpolation If f 1 is not available but f is either available or easily approximated Use a grid of x 1, x 2,, x m spanning the region of support of the target distribution Calculate or approximate u i = f(x i ) Draw U ~ Unif(0, 1) and linearly interpolate between the two nearest grid points for which u i U u j according to J. Pei: Computational Statistics 62

63 Linear Interpolation: Pros and Cons The degree of approximation is deterministic Can be reduced to any desired level by increasing m sufficiently Require a complete approximation to f regardless of the desired sample size Do not generalize to multiple dimensions Less efficient than other approaches. J. Pei: Computational Statistics 63

64 Rejection Sampling Assumptions f(x) can be calculated or approximated We know how to sample from density distribution g and can calculate g(x) We have an envelope e(.) such that e(x) = g(x) / α f(x) for all x that f(x) > 0, and α 1 is a constant Algorithm 1. Sample Y ~ g 2. Sample U ~ Unif(0, 1) 3. Reject Y if U > f(y) / e(y), return to step 1 4. Keep the value of Y as a unit of the random sample of f J. Pei: Computational Statistics 64

65 Illustration J. Pei: Computational Statistics 65

66 Why Does Rejection Sampling Work? J. Pei: Computational Statistics 66

67 Efficiency of Rejection Sampling Rejection sampling can be regarded as uniform sampling from the 2-d region under the curve e, and then discard draws between e and f The larger the area between e and, the lower the efficiency many draws may have to be discarded J. Pei: Computational Statistics 67

68 Squeezed Rejection Sampling For cases where evaluating f is costly Use a non-negative squeezing function s such that s(x) f(x) for any point f(x) > 0 Algorithm Sample Y ~ g Sample U ~ Unif(0, 1) If U s(y) / e(y), keep the value as a unit in the sample Otherwise, check whether U f(y) / e(y). If so, keep the value as a unit of the sample J. Pei: Computational Statistics 68

69 Illustration Saving J. Pei: Computational Statistics 69

70 Envelope & Squeezing Functions? Challenge: how to develop envelope and squeezing functions automatically? Assumptions Let l(x) = log f(x) f(x) > 0 on a (possibly infinite) interval f is log-concave, that is, for any a < b < c in the support region of f, l(a) 2l(b) + l(c) f is continuous and differentiable l (x) exists and decreases monotonically with respect to x (but may have discontinuities) J. Pei: Computational Statistics 70

71 Adaptive Rejection Sampling Evaluate l and l at k points x 1 < x 2 < < x k The tangents at x i and x i+1 intersect at for I = 1,, k 1 For, defines an envelope Similarly, defines a squeezing function J. Pei: Computational Statistics 71

72 Illustration J. Pei: Computational Statistics 72

73 Illustration J. Pei: Computational Statistics 73

74 An Alternative Not Using l Define L i (.) to be the straight line function connecting (x i, l(x i )) and (x i+1, l(x i+1 )) for i = 1,, k 1 defines an envelope J. Pei: Computational Statistics 74

75 Illustration J. Pei: Computational Statistics 75

76 Approximate Simulation Suppose we have an envelope g for target density f Let X = {x 1, x 2,, x m } be a set of values drawn iid from g Define the standardized importance weights as J. Pei: Computational Statistics 76

77 Sampling Importance Resampling Sampling candidates Y 1, Y 2,, Y m iid from g Calculate the standardized importance weights w(y 1 ), w(y 2 ),, w(y m ) Resample X 1, X 2,, X m from Y 1, Y 2,, Y m with replacement with probabilities w(y 1 ), w(y 2 ),, w(y m ) A random variable X drawn with the sampling importance resampling algorithm has distribution that converges to f as m J. Pei: Computational Statistics 77

78 Comparison Rejection sampling is perfect the distribution of a generated draw is exactly f, but requires a random number of draws to obtain a sample of size n Sampling importance resampling uses a predefined number of draws to generate a sample of size n but permits a random degree of approximation to f in the distribution of the sampled points J. Pei: Computational Statistics 78

79 MARKOV CHAIN MONTE CARLO J. Pei: Computational Statistics 79

80 Joint Distribution of a Sequence A sequence of random variables {X (t) }, t = 0, 1, Each X (t) may equal one of a finite or countable infinite number of possible variables, called states The state space is the set of possible values of the random variable X (t) A complete probabilistic specification of X (0), X (1),, X (n), i.e., the joint distribution, is J. Pei: Computational Statistics 80

81 Markov Property The conditional independence assumption Then, the joint distribution can be simplified J. Pei: Computational Statistics 81

82 Markov Chains Let be the probability that the observed state changes from state i at time t to state j at time t + 1 The sequence {X (t) }, t = 0, 1, is a Markov chain if A Markov chain is time homogeneous if, and time-inhomogeneous otherwise J. Pei: Computational Statistics 82

83 Transition Probability Matrix For a time homogeneous Markov chain The state space has s states Transition probability matrix P = [p ij ] describes the one-state transition probability 0 p ij 1 J. Pei: Computational Statistics 83

84 Example J. Pei: Computational Statistics 84

85 Recurrent and Nonnull States A state is recurrent if the chain returns to the state with probability 1 P( {X(t) = i} = ) = 1 Otherwise, the state is transient, that is, we will have a non-zero probability to leave the state and never come back A state is nonnull (aka positive recurrent) if the expected time until recurrence is finite If the state space is finite, every recurrent state is nonnull J. Pei: Computational Statistics 85

86 Ergodic Markov Chains A Markov chain is irreducible if for any i and j, state j can be reached from state i in a finite number of steps There exists m > 0 such that A state j has period d > 0 if the probability of going from state j to state j itself in n steps is 0 for all n not divisible by d A Markov chain is aperiodic if every state has period 1; otherwise it is periodic A Markov chain is ergodic if it is irreducible, aperiodic, and all states are nonnull and recurrent J. Pei: Computational Statistics 86

87 Stationary Distribution Let π be a vector of probabilities such that Σπ i = 1 π i denotes the marginal probability that X (t) = i The marginal distribution of X (t+1) is π T P A distribution π such that π T P = π T is called a stationary distribution for the Markov chain having transition probability matrix P If X (t) follows a stationary distribution, then the marginal distributions of X (t) and X (t+1) are identical J. Pei: Computational Statistics 87

88 Reversible Markov Chains A time-homogeneous Markov chain is reversible if for any i and j in the state space, π i p ij = π j p ji (aka detailed balance condition) π is a stationary distribution for the chain The joint distribution of a sequence of observations is the same no matter the chain is run forwards or backwards J. Pei: Computational Statistics 88

89 Uniqueness of Stationary Distribution If a Markov chain with transition probability matrix P and stationary distribution π is irreducible and aperiodic, then π is unique and π j s are the solutions to the following set of equations J. Pei: Computational Statistics 89

90 Ergodic Theorem If X (1), X (2), are realizations from an irreducible and aperiodic Markov chain with stationary distribution π, then X (n) converges in distribution to the distribution given by π, and for any function h, provided E π { h(x) } exists J. Pei: Computational Statistics 90

91 Bayesian Inference Suppose X has a distribution parameterized by θ Prior distribution f(θ): the density assigned to θ before observing the data Bayes theorem: f(θ x): the posterior density of θ, used for statistical inference about θ c = 1/ f(θ)l(θ x)dθ, often difficult to compute directly J. Pei: Computational Statistics 91

92 Bayes Factor Let be the posterior mode, and θ * be the true value of θ The posterior distribution of converges to N(θ, I(θ ) 1 ) as n, under regularity conditions The observed data should overwhelm any prior as n For two competing hypotheses H 1 and H 2, the Bayes factor is J. Pei: Computational Statistics 92

93 Markov Chain Monte Carlo Idea: construct an irreducible, aperiodic Markov chain whose stationary distribution equals the target distribution f in Monte Carlo Challenge: the distribution of X (t) may differ substantially from f when t is too small and X (t) are serially dependent J. Pei: Computational Statistics 93

94 Metropolis-Hastings Algorithm At t = 0, select X (0) = x (0) drawn at random from some starting distribution g with the requirement f(x (0) ) > 0 Let X (t) = x(t), generate X (t+1) as follows 1. Sample a candidate value X * from a proposal distribution g(. x (t) ) 2. Compute the Metropolis-Hastings ratio R(x (t), X * ) 3. Sample a value for X(t+1) according to the following 4. Increment t and return to step 1 J. Pei: Computational Statistics 94

95 Why Does Metroplis-Hastings Work? Markov: X (t+1) is only dependent on X (t) A user has to check whether the chain is irreducible and aperiodic If so, the chain generated has a unique limiting stationary distribution Intuition We can compute a function proportional to the target distribution, but not exact As more and more sample values are produced, the distribution of values more closely approximates the desired distribution J. Pei: Computational Statistics 95

96 Burn-in Period Sometimes the chain is dependent on the starting value persistently Burn-in period: omit some of the initial realizations of the chain when computing a sample average Challenge: how to design proposal distributions? J. Pei: Computational Statistics 96

97 Independence Chains Set g(x * x (t) ) = g(x * ) for some fixed density g Each candidate value is drawn independently of the past The Metropolis-Hastings ratio is The resulting Markov chain is irreducible and aperiodic if g(x) > 0 whenever f(x) > 0 J. Pei: Computational Statistics 97

98 Example Consider a set of observed data points sampled iid from the mixture distribution Find the value of δ J. Pei: Computational Statistics 98

99 Setting Prior and Proposal Distribution Assume the prior distribution for δ is Unif(0, 1) Proposal distributions Chain 1 uses Beta(1, 1) Equivalent to Unif(0, 1) Chain 2 uses Beta(2, 10), skewed right with mean ~ The values around 0.7 are unlikely to be generated J. Pei: Computational Statistics 99

100 Sample Paths J. Pei: Computational Statistics 100

101 Histograms of δ (t) J. Pei: Computational Statistics 101

102 Random Walk Chains Draw ε ~ h(ε) for some density h Common choices for h include a uniform distribution over a ball centered at the origin, a scaled standard normal distribution, and a scaled Student s t distribution Set X* = x(t) + ε g(x * x (t) ) = h(x * x (t) ) If the support region of f is connected and h is positive in a neighborhood of 0, the resulting chain is irreducible and aperiodic J. Pei: Computational Statistics 102

103 Why Do Random Walk Chains Work? J. Pei: Computational Statistics 103

104 Gibbs Sampling Idea Sampling in a high dimensional space sometimes is difficult Gibbs sampling sequentially samples from univariate conditional distributions Let Suppose that the univariate conditional density of, denoted by, is easily sampled for i = 1,..., p J. Pei: Computational Statistics 104

105 Gibbs Sampling Algorithm Select starting values x (0), and set t = 0 Generate, in turn, Increment t and go to step 2 J. Pei: Computational Statistics 105

106 BOOTSTRAPPING J. Pei: Computational Statistics 106

107 Feature Estimation Consider a cumulative distribution function F We are interested in a feature of F expressed as a function θ = T(F) of F Example: is the mean of F Suppose we have random variables X 1,, X n ~ i.i.d. F (denoted as X ~ F), and x 1,, x n are realization of X 1,, X n If is the empirical distribution function of the observed data, then is an estimate of θ J. Pei: Computational Statistics 107

108 Unknown Distribution Function When F is unknown, we are still interested in or R(X, F), where X is the set of observed data Example:, where is the estimated standard deviation of Bootstrap: approximate the distribution of using the observed data (an estimate of F) J. Pei: Computational Statistics 108

109 Bootstrap Sample of Pseudo-data Also called a pseudo-dataset Each is an i.i.d. random variables with distribution J. Pei: Computational Statistics 109

110 A Simple Example Let {x1, x2, x3} = {1, 2, 6} is an i.i.d. sample from a distribution F that has mean θ We want to estimate sample mean Let consist elements drawn i.i.d. from -- there are 27 possible outcomes for Let be the empirical distribution function of such as sample and There are only 10 possible outcomes of J. Pei: Computational Statistics 110

111 Possible Outcomes of J. Pei: Computational Statistics 111

112 The Bootstrap Principle Approximate using Example: 25/27 (~93%) confidence interval for θ is (4/3, 14/3) using quantiles of the distribution of J. Pei: Computational Statistics 112

113 Non-parametric Bootstrap If the sample size is big, the number of potential bootstrap pseudo-datasets is very large impossible to enumerate all possible bootstrap pseudo-dataset Draw B independent random bootstrap pseudo-datasets for i = 1,, B Approximate R(X, F) using the empirical distribution of for i = 1,, B The simulation error can be arbitrarily small by increasing B J. Pei: Computational Statistics 113

114 A Simple Example Draw with replacement from {1, 2, 6} with equal probability Each bootstrap pseudo-dataset produces a corresponding estimation of J. Pei: Computational Statistics 114

115 Bootstrap Bias Correction If we set bias of The mean is bootstrap by, estimated using, it is the J. Pei: Computational Statistics 115

116 Ensemble Classifiers C*(x)=Vote(C1(x),, Ck(x)) Figure from [Tan, Steinbach, Kumar] J. Pei: Computational Statistics 116

117 Why May Ensemble Method Work? Suppose there are two classes and each base classifier has an error rate of 35% What if we use 25 base classifiers? If all base classifiers are identical, the ensemble error rate is still 35% If base classifiers are independent, the ensemble makes a wrong prediction only if more than half of the base classifiers are wrong 25 i= i i 25 i = 0.06 J. Pei: Computational Statistics 117

118 Ensemble Error Rate Figure from [Tan, Steinbach, Kumar] J. Pei: Computational Statistics 118

119 Ensemble Classifiers When? The base classifiers should be independent of each other Each base classifier should do better than a classifier that performs random guessing J. Pei: Computational Statistics 119

120 How to Construct Ensemble? Manipulating the training set: derive multiple training sets and build a base classifier on each Manipulating the input features: use only a subset of features in a base classifier Manipulating the class labels: if there are many classes, in a classifier, randomly divide the classes into two subsets A and B; for a test case, if a base classifier predicts its class as A, all classes in A receive a vote Manipulating the learning algorithm, e.g., using different network configuration in ANN J. Pei: Computational Statistics 120

121 Bootstrap Given an original training set T, derive a tranining set T by repeatedly uniformly sampling with replacement If T has n tuples, each tuple has a probability p = 1 - (1-1/n) n of being selected in T When n, p 1-1/e Use the tuples not in T as the test set J. Pei: Computational Statistics 121

122 Bagging Run bootstrap k times to obtain k base classifiers A test instance is assigned to the class that receives the highest number of votes Strength: reduce the variance of base classifiers good for unstable base classifiers Unstable classifiers: sensitive to minor perturbations in the training set, e.g., decision trees, associative classifiers, and ANN For stable classifiers (e.g., linear discriminant analysis and knn classifiers), bagging may even degrade the performance since the training sets are smaller Less overfitting on noisy data J. Pei: Computational Statistics 122

123 Bootstrap in Testing Use a bootstrap sample as the training set, use the tuples not in the training set as the test set.632 bootstrap: compute the overall accuracy by combining the accuracies of each bootstrap sample with the accuracy computed from a classifier using the whole data set as the training set k 1 acc 632bootstrap = (0.632 ε i acc k. all 1 ) J. Pei: Computational Statistics 123

124 NONPARAMETRIC DENSITY ESTIMATION J. Pei: Computational Statistics 124

125 Density Estimation Using observations of random variables X 1, X 2,, X n sampled independently from a density function f, estimate f Why density estimation? Assess multimodality, skew, tail behavior, Decision making, classification, summarize Bayesian posteriors A useful presentation tool to summarize a distribution A tool in some other methods, such as MCMC J. Pei: Computational Statistics 125

126 General Idea Assume a parametric model X 1, X 2,, X n ~ iid f X θ, where θ is a very low-dimensional parameter vector Estimate parameter using some estimation paradigm, such as maximum likelihood, Bayesian, or method-of-moments estimation The resulting density estimate at x is If the assumed model f X θ is incorrect, the inferential error is big J. Pei: Computational Statistics 126

127 Moments When a set of points representing a probability density 0 th moment: total of probability, i.e., 1 1 st moment: mean 2 nd moment: variance 3 rd moment: skewness Normalized moment: J. Pei: Computational Statistics 127

128 Method of Moments Idea: use relations between population moments and parameters of interest To estimate k unknown parameters θ = (θ 1, θ 2,, θ k ) of distribution f X (x θ), if the first k moments of the true distribution can be expressed as functions of θ: Draw a sample of n units x 1, x 2,, x n, then the i-th sample moment is An unbiased estimation of population moments Solve equations J. Pei: Computational Statistics 128

129 Nonparametric Density Estimation Assume very little about the form of f Use mainly local information to estimate f at a point x J. Pei: Computational Statistics 129

130 Motivation If f is smooth enough and we observe a point X i = x i, we assume f assigns some density not only at xi but also in a region around x i To estimate f from X 1, X 2,, X n ~ iid f, we accumulate localized probability density contributions in regions around each X i J. Pei: Computational Statistics 130

131 h-neighborhood To estimate the density at a point x, consider a region of width dx = 2h, centered at x h is a parameter to be set The proportion of the observations that fall in the interval γ = [x h, x + h] approximates the density at x Take, then Here, 1 {A} = 1 if A is true, otherwise 0 J. Pei: Computational Statistics 131

132 Estimation is the number of sample points in the interval γ N γ is a Bin(n, p(γ)) random variable, where E[N γ /n] = p(γ) var[n γ /n] = p(γ)(1 p(γ)) / n To obtain accurate estimation, h 0 and n, since var[n γ /n] 0 when n Thus, we should require nh, h 0 as n J. Pei: Computational Statistics 132

133 Performance Measures Integrated squared error (ISE) Mean integrated squared error (MISE) MISE(h) = E{ISE(h)} MISE(h) is the accumulation of local mean squared error at each x J. Pei: Computational Statistics 133

134 Kernel Functions Recall Every point in the h-neighborhood weights the same Intuition: closer points should be weighted more Idea: use a kernel function A kernel function K() is a non-negative realvalued integrate-able function such that and K(-u) = k(u) for any u J. Pei: Computational Statistics 134

135 Kernel Density Estimation Using a kernel function K, estimate by J. Pei: Computational Statistics 135

136 Bandwidth Parameter h is called the (fixed) bandwidth The bandwidth has strong influence on the estimator If h is too small, tend to assign probability density too locally near observed data, resulting in a very wiggly estimated density function with many false modes If h is too large, spread probability density contributions too diffusely averaging over neighborhoods that are too large smooths away important features of f The bandwidth controls the trade-off between the bias and variance of the estimator J. Pei: Computational Statistics 136

137 Example J. Pei: Computational Statistics 137

138 Roughness Measure the roughness of a function g Assume R(K) < and f and f are bounded and continuous, R(f ) < Recall Thus, J. Pei: Computational Statistics 138

139 Computing the Bias Using the Taylor series expansion, Since K is symmetric about 0, Thus, J. Pei: Computational Statistics 139

140 Computing the Variance Using a similar strategy Thus, where (asymptotic mean integrated squared error) J. Pei: Computational Statistics 140

141 Optimal Bandwidth Set h at an intermediate value that avoids excessive bias and excessive variability Theoretically, is the optimal bandwidth Not practically useful since it depends on f, which is to be estimated Turn to some heuristic methods J. Pei: Computational Statistics 141

142 Cross-Validation Ideas Relate h to some quantified quality measure Q(h) on as an estimator of f Optimize and find the optimal h Problem: we use the observed data to calculate and use the same data to evaluate the quality of may lead to overfitting Remedy: use some data the to learn a model and the rest of data to evaluate J. Pei: Computational Statistics 142

143 Cross-Validation To evaluate the quality of at the i-th data point, use all the data except for the i-th data point to train the model Denote the estimated density at X i using a kernel density estimator with all the observations except for X i by Set a function of The bandwidth estimated can be highly sensitive to sampling variability J. Pei: Computational Statistics 143

144 Using Pseudo-Likelihood Set to pseudo-likelihood (PL) Optimize PL(h) and find the optimal h Simple and intuitively appealing Estimation produced is too wiggly and sensitive to outliers J. Pei: Computational Statistics 144

145 Unbiased Cross-Validation Criterion Rewrite the integrated squared error as The last term is constant The second term can be estimated by Thus, minimizing h can find a good h It is called the unbiased cross-validation criterion because E{ UCV (h) + R(f)} = MISE (h) Aka least squares cross-validation because choosing h to minimize UCV (h) minimizes the integrated squared error between and f with respect to J. Pei: Computational Statistics 145

146 Using a Normal Kernel If analytic evaluation of is infeasible, use a different kernel that permits an analytic simplification If a normal kernel ϕ is used, Slow convergence to the optimum Strong dependence on the observed data when applied to different data sets drawn from the same distribution, may yield very different answers J. Pei: Computational Statistics 146

147 Plug-in Methods Apply a pilot bandwidth to estimate one or more important features of f The bandwidth for estimating f itself is then estimated at a second stage using a criterion depending on the estimated features Theoretically, is the optimal bandwidth Key: estimate R(f ) Very effective in many applications J. Pei: Computational Statistics 147

148 Silverman s Rule of Thumb Replace f by a normal density with variance set to match the sample variance Equivalently, estimate R(f ) by, where ϕ is the standard normal density function J. Pei: Computational Statistics 148

149 Empirical Estimation of R(f ) Empirical estimation of R(f ) in The kernel based estimator is h 0 is the bandwidth and L is a sufficiently differentiable kernel used to estimate f The best bandwidth for estimating f differs from the best bandwidth for estimating f or R(f ) A larger bandwidth is required for estimating f, that is h 0 > h J. Pei: Computational Statistics 149

150 The Sheather-Jones Method A two-stage process A simple rule of thumb is used to calculate the bandwidth h 0 Use h 0 to estimate R(f ) Compute h using J. Pei: Computational Statistics 150

151 An Implementation For univariate kernel density estimation with pilot kernel L = ϕ, the Sheather-Jones bandwidth is the value of h that solves the equation J. Pei: Computational Statistics 151

152 An Implementation (Cont d) Here, Estimation using, for example, Newton s method J. Pei: Computational Statistics 152

153 Interquartile Range (IQR) A measure of statistical dispersion, aka midspread or middle fifty The difference between the upper and lower quartiles: IQR = Q 3 Q 1 IQR = 10 J. Pei: Computational Statistics 153

154 Maximal Smoothing Principle The bandwidth should be selected to discourage false modes, producing an estimate that shows modes only where the data indisputably require them Replace R(f ) with the most conservative (i.e., smallest) possible value Consider all h that would minimize, and select the largest h The right-hand side of should be maximized with respect to f J. Pei: Computational Statistics 154

155 Implementation Set J. Pei: Computational Statistics 155

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Monte Carlo Methods. Leon Gu CSD, CMU

Monte Carlo Methods. Leon Gu CSD, CMU Monte Carlo Methods Leon Gu CSD, CMU Approximate Inference EM: y-observed variables; x-hidden variables; θ-parameters; E-step: q(x) = p(x y, θ t 1 ) M-step: θ t = arg max E q(x) [log p(y, x θ)] θ Monte

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Spring 2012 Math 541B Exam 1

Spring 2012 Math 541B Exam 1 Spring 2012 Math 541B Exam 1 1. A sample of size n is drawn without replacement from an urn containing N balls, m of which are red and N m are black; the balls are otherwise indistinguishable. Let X denote

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

16 : Markov Chain Monte Carlo (MCMC)

16 : Markov Chain Monte Carlo (MCMC) 10-708: Probabilistic Graphical Models 10-708, Spring 2014 16 : Markov Chain Monte Carlo MCMC Lecturer: Matthew Gormley Scribes: Yining Wang, Renato Negrinho 1 Sampling from low-dimensional distributions

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X. Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

MCMC algorithms for fitting Bayesian models

MCMC algorithms for fitting Bayesian models MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models

More information

Answers and expectations

Answers and expectations Answers and expectations For a function f(x) and distribution P(x), the expectation of f with respect to P is The expectation is the average of f, when x is drawn from the probability distribution P E

More information

BAYESIAN DECISION THEORY

BAYESIAN DECISION THEORY Last updated: September 17, 2012 BAYESIAN DECISION THEORY Problems 2 The following problems from the textbook are relevant: 2.1 2.9, 2.11, 2.17 For this week, please at least solve Problem 2.3. We will

More information

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling Professor Erik Sudderth Brown University Computer Science October 27, 2016 Some figures and materials courtesy

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC Stat 451 Lecture Notes 07 12 Markov Chain Monte Carlo Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Chapters 8 9 in Givens & Hoeting, Chapters 25 27 in Lange 2 Updated: April 4, 2016 1 / 42 Outline

More information

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Bagging During Markov Chain Monte Carlo for Smoother Predictions Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods

More information

Bayesian Inference and MCMC

Bayesian Inference and MCMC Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the

More information

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence Bayesian Inference in GLMs Frequentists typically base inferences on MLEs, asymptotic confidence limits, and log-likelihood ratio tests Bayesians base inferences on the posterior distribution of the unknowns

More information

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods Pattern Recognition and Machine Learning Chapter 11: Sampling Methods Elise Arnaud Jakob Verbeek May 22, 2008 Outline of the chapter 11.1 Basic Sampling Algorithms 11.2 Markov Chain Monte Carlo 11.3 Gibbs

More information

Eco517 Fall 2013 C. Sims MCMC. October 8, 2013

Eco517 Fall 2013 C. Sims MCMC. October 8, 2013 Eco517 Fall 2013 C. Sims MCMC October 8, 2013 c 2013 by Christopher A. Sims. This document may be reproduced for educational and research purposes, so long as the copies contain this notice and are retained

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

17 : Markov Chain Monte Carlo

17 : Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo

More information

Stat 516, Homework 1

Stat 516, Homework 1 Stat 516, Homework 1 Due date: October 7 1. Consider an urn with n distinct balls numbered 1,..., n. We sample balls from the urn with replacement. Let N be the number of draws until we encounter a ball

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

6 Markov Chain Monte Carlo (MCMC)

6 Markov Chain Monte Carlo (MCMC) 6 Markov Chain Monte Carlo (MCMC) The underlying idea in MCMC is to replace the iid samples of basic MC methods, with dependent samples from an ergodic Markov chain, whose limiting (stationary) distribution

More information

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) Markov Chain Monte Carlo (MCMC Dependent Sampling Suppose we wish to sample from a density π, and we can evaluate π as a function but have no means to directly generate a sample. Rejection sampling can

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

STAT 425: Introduction to Bayesian Analysis

STAT 425: Introduction to Bayesian Analysis STAT 425: Introduction to Bayesian Analysis Marina Vannucci Rice University, USA Fall 2017 Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 2) Fall 2017 1 / 19 Part 2: Markov chain Monte

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Review. DS GA 1002 Statistical and Mathematical Models.   Carlos Fernandez-Granda Review DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall16 Carlos Fernandez-Granda Probability and statistics Probability: Framework for dealing with

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Markov chain Monte Carlo

Markov chain Monte Carlo Markov chain Monte Carlo Markov chain Monte Carlo (MCMC) Gibbs and Metropolis Hastings Slice sampling Practical details Iain Murray http://iainmurray.net/ Reminder Need to sample large, non-standard distributions:

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters

More information

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning for for Advanced Topics in California Institute of Technology April 20th, 2017 1 / 50 Table of Contents for 1 2 3 4 2 / 50 History of methods for Enrico Fermi used to calculate incredibly accurate predictions

More information

Lecture 8: The Metropolis-Hastings Algorithm

Lecture 8: The Metropolis-Hastings Algorithm 30.10.2008 What we have seen last time: Gibbs sampler Key idea: Generate a Markov chain by updating the component of (X 1,..., X p ) in turn by drawing from the full conditionals: X (t) j Two drawbacks:

More information

Computer Intensive Methods in Mathematical Statistics

Computer Intensive Methods in Mathematical Statistics Computer Intensive Methods in Mathematical Statistics Department of mathematics johawes@kth.se Lecture 16 Advanced topics in computational statistics 18 May 2017 Computer Intensive Methods (1) Plan of

More information

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Statistics - Lecture One. Outline. Charlotte Wickham  1. Basic ideas about estimation Statistics - Lecture One Charlotte Wickham wickham@stat.berkeley.edu http://www.stat.berkeley.edu/~wickham/ Outline 1. Basic ideas about estimation 2. Method of Moments 3. Maximum Likelihood 4. Confidence

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA Intro: Course Outline and Brief Intro to Marina Vannucci Rice University, USA PASI-CIMAT 04/28-30/2010 Marina Vannucci

More information

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

SAMPLING ALGORITHMS. In general. Inference in Bayesian models SAMPLING ALGORITHMS SAMPLING ALGORITHMS In general A sampling algorithm is an algorithm that outputs samples x 1, x 2,... from a given distribution P or density p. Sampling algorithms can for example be

More information

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University Likelihood, MLE & EM for Gaussian Mixture Clustering Nick Duffield Texas A&M University Probability vs. Likelihood Probability: predict unknown outcomes based on known parameters: P(x q) Likelihood: estimate

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

The Bayes classifier

The Bayes classifier The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal

More information

Introduction to MCMC. DB Breakfast 09/30/2011 Guozhang Wang

Introduction to MCMC. DB Breakfast 09/30/2011 Guozhang Wang Introduction to MCMC DB Breakfast 09/30/2011 Guozhang Wang Motivation: Statistical Inference Joint Distribution Sleeps Well Playground Sunny Bike Ride Pleasant dinner Productive day Posterior Estimation

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Math 494: Mathematical Statistics

Math 494: Mathematical Statistics Math 494: Mathematical Statistics Instructor: Jimin Ding jmding@wustl.edu Department of Mathematics Washington University in St. Louis Class materials are available on course website (www.math.wustl.edu/

More information

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3 Hypothesis Testing CB: chapter 8; section 0.3 Hypothesis: statement about an unknown population parameter Examples: The average age of males in Sweden is 7. (statement about population mean) The lowest

More information

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods Prof. Daniel Cremers 14. Sampling Methods Sampling Methods Sampling Methods are widely used in Computer Science as an approximation of a deterministic algorithm to represent uncertainty without a parametric

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Theory of Maximum Likelihood Estimation. Konstantin Kashin Gov 2001 Section 5: Theory of Maximum Likelihood Estimation Konstantin Kashin February 28, 2013 Outline Introduction Likelihood Examples of MLE Variance of MLE Asymptotic Properties What is Statistical

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

STA 294: Stochastic Processes & Bayesian Nonparametrics

STA 294: Stochastic Processes & Bayesian Nonparametrics MARKOV CHAINS AND CONVERGENCE CONCEPTS Markov chains are among the simplest stochastic processes, just one step beyond iid sequences of random variables. Traditionally they ve been used in modelling a

More information

On Markov chain Monte Carlo methods for tall data

On Markov chain Monte Carlo methods for tall data On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational

More information

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods Prof. Daniel Cremers 11. Sampling Methods Sampling Methods Sampling Methods are widely used in Computer Science as an approximation of a deterministic algorithm to represent uncertainty without a parametric

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Lecture 8: Bayesian Estimation of Parameters in State Space Models in State Space Models March 30, 2016 Contents 1 Bayesian estimation of parameters in state space models 2 Computational methods for parameter estimation 3 Practical parameter estimation in state space

More information

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Monte Carlo Studies. The response in a Monte Carlo study is a random variable. Monte Carlo Studies The response in a Monte Carlo study is a random variable. The response in a Monte Carlo study has a variance that comes from the variance of the stochastic elements in the data-generating

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

Monte Carlo in Bayesian Statistics

Monte Carlo in Bayesian Statistics Monte Carlo in Bayesian Statistics Matthew Thomas SAMBa - University of Bath m.l.thomas@bath.ac.uk December 4, 2014 Matthew Thomas (SAMBa) Monte Carlo in Bayesian Statistics December 4, 2014 1 / 16 Overview

More information

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version)

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version) A quick introduction to Markov chains and Markov chain Monte Carlo (revised version) Rasmus Waagepetersen Institute of Mathematical Sciences Aalborg University 1 Introduction These notes are intended to

More information

Lecture 7 and 8: Markov Chain Monte Carlo

Lecture 7 and 8: Markov Chain Monte Carlo Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani

More information

Probability and Information Theory. Sargur N. Srihari

Probability and Information Theory. Sargur N. Srihari Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal

More information

SC7/SM6 Bayes Methods HT18 Lecturer: Geoff Nicholls Lecture 2: Monte Carlo Methods Notes and Problem sheets are available at http://www.stats.ox.ac.uk/~nicholls/bayesmethods/ and via the MSc weblearn pages.

More information

EM Algorithm II. September 11, 2018

EM Algorithm II. September 11, 2018 EM Algorithm II September 11, 2018 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data

More information

Learning the hyper-parameters. Luca Martino

Learning the hyper-parameters. Luca Martino Learning the hyper-parameters Luca Martino 2017 2017 1 / 28 Parameters and hyper-parameters 1. All the described methods depend on some choice of hyper-parameters... 2. For instance, do you recall λ (bandwidth

More information

Lecture 6: Markov Chain Monte Carlo

Lecture 6: Markov Chain Monte Carlo Lecture 6: Markov Chain Monte Carlo D. Jason Koskinen koskinen@nbi.ku.dk Photo by Howard Jackman University of Copenhagen Advanced Methods in Applied Statistics Feb - Apr 2016 Niels Bohr Institute 2 Outline

More information

Stat 451 Lecture Notes Monte Carlo Integration

Stat 451 Lecture Notes Monte Carlo Integration Stat 451 Lecture Notes 06 12 Monte Carlo Integration Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Chapter 6 in Givens & Hoeting, Chapter 23 in Lange, and Chapters 3 4 in Robert & Casella 2 Updated:

More information

Foundations of Nonparametric Bayesian Methods

Foundations of Nonparametric Bayesian Methods 1 / 27 Foundations of Nonparametric Bayesian Methods Part II: Models on the Simplex Peter Orbanz http://mlg.eng.cam.ac.uk/porbanz/npb-tutorial.html 2 / 27 Tutorial Overview Part I: Basics Part II: Models

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As

More information

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018 Graphical Models Markov Chain Monte Carlo Inference Siamak Ravanbakhsh Winter 2018 Learning objectives Markov chains the idea behind Markov Chain Monte Carlo (MCMC) two important examples: Gibbs sampling

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Part III. A Decision-Theoretic Approach and Bayesian testing

Part III. A Decision-Theoretic Approach and Bayesian testing Part III A Decision-Theoretic Approach and Bayesian testing 1 Chapter 10 Bayesian Inference as a Decision Problem The decision-theoretic framework starts with the following situation. We would like to

More information

Robert Collins CSE586, PSU Intro to Sampling Methods

Robert Collins CSE586, PSU Intro to Sampling Methods Robert Collins Intro to Sampling Methods CSE586 Computer Vision II Penn State Univ Robert Collins A Brief Overview of Sampling Monte Carlo Integration Sampling and Expected Values Inverse Transform Sampling

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016 Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016 EPSY 905: Intro to Bayesian and MCMC Today s Class An

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Test Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics

Test Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics Test Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics The candidates for the research course in Statistics will have to take two shortanswer type tests

More information

Extreme Value Analysis and Spatial Extremes

Extreme Value Analysis and Spatial Extremes Extreme Value Analysis and Department of Statistics Purdue University 11/07/2013 Outline Motivation 1 Motivation 2 Extreme Value Theorem and 3 Bayesian Hierarchical Models Copula Models Max-stable Models

More information

Gaussian Models

Gaussian Models Gaussian Models ddebarr@uw.edu 2016-04-28 Agenda Introduction Gaussian Discriminant Analysis Inference Linear Gaussian Systems The Wishart Distribution Inferring Parameters Introduction Gaussian Density

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Robert Collins CSE586, PSU Intro to Sampling Methods

Robert Collins CSE586, PSU Intro to Sampling Methods Intro to Sampling Methods CSE586 Computer Vision II Penn State Univ Topics to be Covered Monte Carlo Integration Sampling and Expected Values Inverse Transform Sampling (CDF) Ancestral Sampling Rejection

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information