ADVANCED MACHINE LEARNING FALL 2017 VERSION: OCTOBER 12, 2017

Size: px
Start display at page:

Download "ADVANCED MACHINE LEARNING FALL 2017 VERSION: OCTOBER 12, 2017"

Transcription

1 ADVANCED MACHINE LEARNING FALL 2017 VERSION: OCTOBER 12, 2017

2 COURSE ADMIN

3 TERM TIMELINE First class Sep 5/6 Midterm Final class Dec 7/11 Python tutorial TensorFlow tutorial Final project due Peter Orbanz John Cunningham Dates Python tutorial TensorFlow tutorial Midterm exam Final project due 12/13 September 24/25 October 19/23 October 11 December Advanced Machine Learning 3 / 212

4 ASSISTANTS AND GRADING Teaching Assistants Ian Kinsella and Wenda Zhou Office Hours Mon/Tue 5:30-7:30pm, Room 1025, Dept of Statistics, 10th floor SSW Class Homepage Homework Some homework problems and final project require coding Coding: Python Homework due: Tue/Wed at 4pm no late submissions You can drop two homeworks from your final score Grade Homework + Midterm Exam + Final Project 20% 40% 40% Advanced Machine Learning 4 / 212

5 HOUSE RULES All to the TAs, please. The instructors will not read your unless it is forwarded by a TA. Problems with exam/project If you cannot take the exam or finish the project: You must let us know the midterm exam/project due date. at least one week before Advanced Machine Learning 5 / 212

6 READING The relevant course material are the slides. Books (optional) See class homepage for references. Advanced Machine Learning 6 / 212

7 OUTLINE (VERY TENTATIVE) Part I (Orbanz) Neural networks (basic definitions and training) Graphical models (ditto) Sampling algorithms Variational inference Optimization for GMs and NNs Part II (Cunningham) NN software Convolutional NNs and computer vision Recurrent NNs Reinforcement learning Dimension reduction and autoencoders Advanced Machine Learning 7 / 212

8 INTRODUCTION

9 AGAIN: MACHINE LEARNING Historical origins: Artificial intelligence and engineering Machines need to... recognize patterns (e.g. vision, language) make decisions based on experience (= data) predict cope with uncertainty Modern applications: (A few) Examples medical diagnosis face detection/recognition speech and handwriting recognition web search recommender systems bioinformatics natural language processing computer vision Today Machine learning and statistics have become hard to tell apart. Advanced Machine Learning 9 / 212

10 LEARNING AND STATISTICS Task Balance the pendulumn upright by moving the sled left and right. The computer can control only the motion of the sled. Available data: Current state of system (measured 25 times/second). Advanced Machine Learning 10 / 212

11 LEARNING AND STATISTICS Formalization State = 4 variables (sled location, sled velocity, angle, angular velocity) Actions = sled movements The system can be described by a function f : S A S (state, action) state Advanced Machine Learning 10 / 212

12 LEARNING AND STATISTICS Advanced Machine Learning 11 / 212

13 LEARNING AND STATISTICS After each run Fit a function to the data obtained in previous runs. Running the system involves: f : S A S (state, action) state 1. The function f, which tells the system how the world works. 2. An optimization method that uses f to determine how to move towards the optimal state. Note well Learning how the world works is a regression problem. Advanced Machine Learning 12 / 212

14 OUR MAIN TOPICS Neural networks Define functions Represented by directed graph Graphical models Define distributions Represented by directed graph Each vertex represents a function Incoming edges: Function arguments Outgoing edges: Function values Learning: Differentiation/optimization Each vertex represents a distribution Incoming edges: Conditions Outgoing edges: Draws from distribution Learning: Estimation/inference Advanced Machine Learning 13 / 212

15 RECALL PREVIOUS TERM sgn( vh, x c) > 0 sgn( vh, x c) < 0 Supervised learning Unsupervised learning Classification Clustering (mixture models) Problems Regression HMMs Dimension reduction (PCA) Solutions Functions Distributions Advanced Machine Learning 14 / 212

16 Neural networks Graphical models x 1 x 2 x 3 v 1 v2 v3 φ y 2 = φ(v t x) φ φ φ ψ ψ ψ... y 1 y 2 y Advanced Machine Learning 15 / 212

17 NNS AND GRAPHICAL MODELS Neural networks Representation of function using a graph Layers: x g f Symbolizes: f (g(x)) f depends on x only through g x 1 x 2 x 3 v 13 v 11 v 12 v21 v 22 v 23 v 31 v 32 v 33 φ φ φ Graphical models y 1 = φ(v 1 t x) y2 = φ(v 2 t x) y3 = φ(v 3 t x) Representation of a distribution using a graph Layers: X Y Z Symbolizes: p(x, z, y) = p(z y)p(y x)p(x) Z is conditionally independent of X given Y Advanced Machine Learning 16 / 212

18 Grouping dependent variables into layers is a good thing. Advanced Machine Learning 17 / 212

19 HISTORICAL PERSPECTIVE: MCCULLOCH-PITTS NEURON MODEL (1943) A neuron is modeled as a thresholding device that combines input signals: x 1 x 2 x 3 McCulloch-Pitts model Collect the input signals x 1, x 2, x 3 into a vector x = (x 1, x 2, x 3 ) R 3 Choose fixed vector v R 3 and constant c R. Compute: y = I{ v, x > c} for some c R. y Advanced Machine Learning 18 / 212

20 READING THE DIAGRAM x 1 x 2 x 3 v 1 v2 v3 c I{ > 0} 1 y y = I{ v 1 x 1 + v 2 x 2 + v 3 x 3 + ( 1)c > 0 } = I{ v, x > c} Recall: Linear classifier f (x) = sgn( v, x c) Advanced Machine Learning 19 / 212

21 LINEAR CLASSIFICATION x v x,v v f (x) = sgn( v, x c) Advanced Machine Learning 20 / 212

22 REMARKS x 1 x 2 x 3 v 1 v2 v3 y = I{v t x > c} The neural network represents a linear two-class classifier (on R 3 ). It does not specify the training method. To train the classifier, we need a cost function and an optimization method. Advanced Machine Learning 21 / 212

23 TRAINING For parameter estimation by optimization, we need an optimization target. Idea: Choose 0-1 loss as simplest loss for classification. Minimize empirical risk (on training data) under this loss. J(a) y 1 a 1 0 y y 3 solution 0 region 2 a Advanced Machine Learning Illustration: R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley / 212

24 THE PERCEPTRON CRITERION Piece-wise constant function not suitable for numerical optimization. Approximate by piece-wise linear function perceptron cost function Jp(a) y 3 2 y -2 2 y solution 0 region 2 a 1 a Advanced Machine Learning Illustration: R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley / 212

25 PERCEPTRON Train McCulloch-Pitts model (that is: estimate (c, v)) by applying gradient descent to the function n ) ( ) C p(c, v) := I{sgn( v, x i c) ỹ i } c (,, v 1 xi i=1 called the Perceptron cost function. Advanced Machine Learning 24 / 212

26 OUTLINE (VERY TENTATIVE) Part I (Orbanz) Neural networks (basic definitions and training) Graphical models (ditto) Sampling algorithms Variational inference Optimization for GMs and NNs Part II (Cunningham) NN software Convolutional NNs and computer vision Recurrent NNs Reinforcement learning Dimension reduction and autoencoders Advanced Machine Learning 25 / 212

27 TOOLS: LOGISTIC REGRESSION

28 SIGMOIDS Sigmoid function σ(x) = e x Note 1 σ(x) = 1 + e x e x = 1 e x + 1 = σ( x) Derivative dσ dx (x) = e x (1 + e x ) 2 = σ(x)( 1 σ(x) ) Sigmoid (blue) and its derivative (red) Advanced Machine Learning 27 / 212

29 APPROXIMATING DECISION BOUNDARIES In linear classification: Decision boundary is a discontinuity Boundary is represented either by indicator function I{ > c} or sign function sign( c) These representations are equivalent: Note sign( c) = 2 I{ > c} The most important use of the sigmoid function in machine learning is as a smooth approximation to the indicator function. Given a sigmoid σ and a data point x, we decide which side of the approximated boundary we are own by thresholding σ(x) 1 2 Advanced Machine Learning 28 / 212

30 SCALING We can add a scale parameter by definining σ θ (x) := σ(θx) = 1 1 e θx for θ R Influence of θ As θ increases, σ θ approximates I more closely. For θ, the sigmoid converges to I pointwise, that is: For every x 0, we have σ θ (x) I{x > 0} as θ +. Note σ θ (0) = 1 always, regardless of θ. 2 Advanced Machine Learning 29 / 212

31 APPROXIMATING A LINEAR CLASSIFIER So far, we have considered R, but linear classifiers usually live in R d. The decision boundary of a linear classifier in R 2 is a discontinuous ridge: We can stretch σ into a ridge function on R 2 : This is a linear classifier of the form I{ v, x c}. Here: v = (1, 1) and c = 0. This is the function x = (x 1, x 2 ) σ(x 1 ). The ridge runs parallel to the x 2 -axes. If we use σ(x 2 ) instead, we rotate by 90 degrees (still axis-parallel). Advanced Machine Learning 30 / 212

32 STEERING A SIGMOID Just as for a linear classifier, we use a normal vector v R d. The function σ( v, x c) is a sigmoid ridge, where the ridge is orthogonal to the normal vector v, and c is an offset that shifts the ridge out of the origin. The plot on the right shows the normal vector (here: v = (1, 1)) in black. The parameters v and c have the same meaning for I and σ, that is, σ( v, x c) approximates I{ v, x c}. Advanced Machine Learning 31 / 212

33 LOGISTIC REGRESSION Logistic regression is a classification method that approximates decision boundaries by sigmoids. Setup Two-class classification problem Observations x 1,..., x n R d, class labels y i {0, 1}. The logistic regression model We model the conditional distribution of the class label given the data as P(y x) := Bernoulli ( σ( v, x c) ). Recall σ( v, x c) takes values in [0, 1] for all θ, and value 1 on the class boundary. 2 The logistic regression model interprets this value as the probability of being in class y. Advanced Machine Learning 32 / 212

34 LEARNING LOGISTIC REGRESSION Since the model is defined by a parametric distribution, we can apply maximum likelihood. Notation Recall from Statistical Machine Learning: We collect the parameters in a vector w by writing ( ( v x w := and x := so that w, x = v, x c. c) 1) Likelihood function of the logistic regression model Negative log-likelihood n σ( w, x i ) y ( i 1 σ( w, x i ) ) 1 y i i=1 n L(w) := (y i log σ( w, x i ) + (1 y i ) log ( 1 σ( w, x i ) )) i=1 Advanced Machine Learning 33 / 212

35 MAXIMUM LIKELIHOOD L(w) = n ( σ(wt x i ) y i ) xi i=1 Note Each training data point x i contributes to the sum proportionally to the approximation error σ(w t x i ) y i incurred at x i by approximating the linear classifier by a sigmoid. Maximum likelihood The ML estimator ŵ for w is the solution of L(w) = 0. For logistic regression, this equation has no solution in closed form. To find ŵ, we use numerical optimization. The function L is convex (= -shaped). Advanced Machine Learning 34 / 212

36 RECALL FROM STATISTICAL MACHINE LEARNING If f is differentiable, we can apply gradient descent: x (k+1) := x (k) f (x (k) ) where x (k) is the candidate solution in step k of the algorithm. If the Hessian matrix H f of partial second derivatives exists and is invertible, we can apply Newton s method, which converges faster: x (k+1) := x (k) H 1 f (x (k) ) f (x (k) ) Recall that the Hessian matrix of a (twice continuously differentiable) function f : R d R is ( 2 f H f (x) := x i x j )i,j n Since f is twice differentiable, each 2 f / x i x j exists; since it is twice continuously differentiable, 2 f / x i x j = 2 f / x j x i, so H f is symmetric. The inverse of H f (x) exists if and only if the matrix is positive definite (semidefinite does not suffice), which in turn is true if and only if f is strictly convex. Advanced Machine Learning 35 / 212

37 NEWTON S METHOD FOR LOGISTIC REGRESSION Applying Newton Matrix notation w (k+1) := w (k) H 1 L (w(k) ) L(w (k) ) X := 1 ( x 1) 1... ( x 1) j... ( x 1) d... 1 ( x i) 1... ( x i) j... ( x i) d... 1 ( x n) 1... ( x n) j... ( x n) d σ(w t x 1)(1 σ(w t x 1))... 0 x i D σ = σ(w t x n)(1 σ(w t x n)) X is the data matrix (or design matrix) you know from linear regression. X has size n (d + 1) and D σ is n n. Newton step σ( w w (k+1) = ( Xt 1 D σ X) Xt D σ ( Xw (k), x 1 ) y 1 (k) ( D σ. σ( w (k). )), x n ) y n Advanced Machine Learning 36 / 212

38 NEWTON S METHOD FOR LOGISTIC REGRESSION σ( w w (k+1) = ( Xt 1 D σ X) Xt D σ ( Xw (k), x 1 ) y 1 (k) ( D σ. σ( w (k). )), x n ) y n = ( Xt D σ X) 1 Xt D σu (k) =: u (k) Compare this to the least squares solution of a linear regression problem: ˆβ = ( X t X) 1 X t y w (k+1) = ( Xt D σ X) 1 Xt D σu (k) Differences: The vector y of regression responses is substituted by the vector u (k) above. The matrix X t X is substituted by the matrix X t D σ X. Note that matrices of product form X t X are positive semidefinite; since D σ is diagonal with non-negative entries, so is X t D σ X. Iteratively Reweighted Least Squares At each step, the algorithm solves a least-squares problem reweighted by the matrix D σ. Since this happens at each step of an iterative algorithm, Newton s method applied to the logistic regression log-likelihood is also known as Iteratively Reweighted Least Squares. Advanced Machine Learning 37 / 212

39 OTHER OPTIMIZATION METHODS Newton: Cost The size of the Hessian is (d + 1) (d + 1). In high-dimensional problems, inverting H L can become problematic. Other methods Maximum likelihood only requires that we minimize the negative log-likelihood; we can choose any numerical method, not just Newton. Alternatives include: Pseudo-Newton methods (only invert H L once, for w (1), but do not guarantee quadratic convergence). Gradient methods. Approximate gradient methods, like stochastic gradient. Advanced Machine Learning 38 / 212

40 OVERFITTING Recall from Statistical Machine Learning H v x If we increase the length of v without changing its direction, the sign of v, x does not change, but the value changes. That means: If v is the normal vector of a classifier, and we scale v by some θ > 0, the decision boundary does not move, but θv, x = θ v, x. longer v Effect inside a sigmoid σ( θv, x ) = σ(θ v, x ) = σ θ ( v, x ) As the length of v increases, σ( v, x ) becomes more similar to I{ v, x > 0} longer v Advanced Machine Learning 39 / 212

41 EFFECT ON ML FOR LOGISTIC REGRESSION σ(w t x i ) y i x i Recall each training data point x i contributes an error term σ(w t x i ) y i to the log-likelihood. By increasing the lenghts of w, we can make σ(w t x i ) y i arbitrarily small without moving the decision boundary. Advanced Machine Learning 40 / 212

42 OVERFITTING Consequence for linearly separable data Once the decision boundary is correctly located between the two classes, the maximization algorithm can increase the log-likelihood arbitrarily by increasing the length of w. That does not move the decision boundary, but he logistic function looks more and more like the indicator I. That may fit the training data more tightly, but can lead to bad generalization (e.g. for similar reasons as for the perceptron, where the decision boundary may end up very close to a training data point). That is a form of overfitting. Data that is not linearly separable If the data is not separable, sufficiently many points on the wrong side of the decision boundary prevent overfitting (since making w larger increases error contributions of these points). For large data sets, overfitting can still occur if the fraction of such points is small. Solutions Overfitting can be addressed by including an additive penalty of the form L(w) + λ w. Advanced Machine Learning 41 / 212

43 LOGISTIC REGRESSION FOR MULTIPLE CLASSES Bernoulli and multinomial distributions The mulitnomial distribution of N draws from K categories with parameter vector (θ 1,..., θ K ) (where k K θ k = 1) has probabililty mass function P(m 1,..., m K θ 1,..., θ K ) = N! m 1! m K! K k=1 θ m k k Note that Bernoulli(p) = Multinomial(p, 1 p; N = 1). Logistic regression where m k = # draws in category k Recall two-class logistic regression is defined by P(Y x) = Bernoulli(σ(w t x)). Idea: To generalize logistic regression to K classes, choose a separate weight vector w k for each class k, and define P(Y x) by where σ(w t 1 x) = σ(wt 1 x) k σ(wt k x). Multinomial ( σ(w t 1 x),..., σ(wt K x)) Advanced Machine Learning 42 / 212

44 LOGISTIC REGRESSION FOR MULTIPLE CLASSES Logistic regression for K classes The label y now takes values in {1,..., K}. K P(y x) = σ(w t k x)i{y=k} k=1 The negative log-likelihood becomes L(w 1,..., w K ) = This can again be optimized numerically. Comparison to two-class case Recall that 1 σ(x) = σ( x). That means i n, k K I{y = k} log σ(w t k x i) Bernoulli ( σ( v, x c) ) Multinomial ( σ(w t x), σ(( 1)w t x) ) That is: Two-class logistic regression as above is equivalent to multiclass logistic regression with K = 2 provided we choose w 2 = w 1. Advanced Machine Learning 43 / 212

45 GRAPHICAL MODELS

46 GRAPHICAL MODELS A graphical model represents the dependence structure within a set of random variables as a graph. Overview Roughly speaking: Each random variable is represented by vertex. If Y depends on X, we draw an edge X Y. For example: X Y Z This says: X depends on Z, and Y depends on Z. We have to be careful: The above does not imply that X and Y are independent. We have to make more precise what depends on means. Advanced Machine Learning 45 / 212

47 We will use the notation: (L means law.) Reason L(X) = distribution of the random variable X L(X Y) = conditional distribution of X given Y If X is discrete, L(X) is usually given by a mass function P(x). If it is continuous, L(X) is usually given by a density p(x). With the notation above, we do not have to distinguish between discrete and continuous variables. Advanced Machine Learning 46 / 212

48 DEPENDENCE AND INDEPENDENCE Recall Dependence between random variables X 1,..., X n is a property of their joint distribution L(X 1,..., X n). Two random variables are stochastically independent, or independent for short, if their joint distribution factorizes: L(X, Y) = L(X)L(Y) For densities/mass functions: Dependent means not independent. P(x, y) = P(x)P(y) or p(x, y) = p(x)p(y) Intuitively X and Y are dependent if knowing the outcome of X provides any information about the outcome of Y. More precisely: If someone draws (X, Y) simultanuously, and only discloses X = x to you, does that change your mind about the distribution of Y? (If so: Dependence.) Once X is given, the distribution of Y is the conditional L(Y X = x). If that is still L(Y), as before X was drawn, the two are independent. If L(Y X = x) L(Y), they are dependent. Advanced Machine Learning 47 / 212

49 CONDITIONAL INDEPENDENCE Definition Given random variables X, Y, Z, we say that X is conditionally independent of Y given Z if That is equivalent to Notation Intuitively L(X, Y Z = z) = L(X Z = z)l(y Z = z). L(X Y = y, Z = z) = L(X Z = z). X Z Y X and Y are dependent given Z = z if, although Z is known, knowing the outcome of X provides additional information about the outcome of Y. Advanced Machine Learning 48 / 212

50 GRAPHICAL MODEL NOTATION Factorizing a joint distribution The joint probability of random variables X 1,..., X n can always be factorized as L(X 1,..., X n) = L(X n X 1,..., X n 1 )L(X n 1 X 1,..., X n 2 ) L(X 1 ). Note that we can re-arrange the variables in any order. If there are conditional independencies, we can remove some variables from the conditionals: L(X 1,..., X n) = L(X n X n)l(x n 1 X n 1 ) L(X 1 ), where X i is the subset of X 1,..., X n on which X i depends. Definition Let X 1,..., X n be random variables. A (directed) graphical model represents a factorization of joint distribution L(X 1,..., X n) as follows: Factorize L(X 1,..., X n). Add one vertex for each variable X i. For each variable X i, add and edge from each variable X j X i to X i. That is: An edge X j X i is added if L(X 1,..., X n) contains the factor L(X i X j ). Advanced Machine Learning 49 / 212

51 GRAPHICAL MODEL NOTATION Lack of uniqueness The factorization is usually not unique, since e.g. L(X, Y) = L(X Y)L(Y) = L(Y X)L(X). That means the direction of edges is not generally determined. Remark If we use a graphical model to define a model or visualize a model, we decide on the direction of the edges. Estimating the direction of edges from data is a very difficult (and very important) problem. This is one of the main subjects of a research field called causal inference or causality. Advanced Machine Learning 50 / 212

52 A simple example X Y Z X Z Y An example with layers Layer 1 Layer 2 All variables in the (k + 1)st layer are conditionally independent given the variables in the kth layer. Advanced Machine Learning 51 / 212

53 WORDS OF CAUTION I X Y Z X Z Y Important X and Y are not independent, independence holds only conditionally on Z. In other words: If we do not observe Z, X and Y are dependent, and we have to change the graph: X Y or X Y Advanced Machine Learning 52 / 212

54 WORDS OF CAUTION II X Y Z Conditioning on Z makes X and Y dependent. Example Suppose we start with two indepedent normal variables X and Y. Z = X + Y. If we know Z, and someone reveals the value of Y to us, we know everything about X. This effect is known as explaining away. We will revisit it later. Advanced Machine Learning 53 / 212

55 MACHINE LEARNING EXAMPLES I Hidden Markov models Z 1 Z 2 Z n 1 Z n X 1 X 2 X n 1 X n Advanced Machine Learning 54 / 212

56 MACHINE LEARNING EXAMPLES II Sigmoid belief networks A graphical model in which each node is a binary random variable and each conditional probability is a logicist regression model is called a sigmoid belief network. Terminology: Belief network or Bayes net are alternative names for graphical models. Deep belief networks A deep belief network is a layered directed graphical model that looks like this. Two tasks for deep belief nets are: Sampling: Draw X 1:d from L(X 1:d Θ 1:d ). Θ 1 Θ 2 Θ d Input layer Note: Variables in each layer conditionally independent given previous layer.... Inference: Estimate L(Θ 1:d X 1:d = x 1:d ) when data x 1:d is observed. Problem: Conditioning a layer on the following one makes variables dependent. (More details later.) X 1 X 2 X d Data Advanced Machine Learning 55 / 212

57 MARKOV RANDOM FIELDS

58 UNDIRECTED GRAPHICAL MODEL A graphical model is undirected when its dependency graph is undirected; equivalently, if each edge in the graph is either absent, or present in both directions. X Y X Y X Y Z Z Z directed undirected An undirected graphical model is more commonly known as a Markov random field. Markov random fields are special cases of (directed) graphical models, but have distinct properties. We treat them separately. We will consider the undirected case first. Advanced Machine Learning 57 / 212

59 OVERVIEW We start with an undirected graph: Θj 1 Θj Θj+1... wi+1,j+1... Θi 1 wi 1,i Θi Θi Θk 1 Θk Θk A random variable Θ i is associated with each vertex. Two random variables interact if they are neighbors in the graph. Advanced Machine Learning 58 / 212

60 NEIGHBORHOOD GRAPH We define a neighborhood graph, which is a weighted, undirected graph: vertex set set of edge weights N = (V N, W N ) The vertices v i V N are often referred to as sites. The edge weights are scalars w ij R. Since the graph is undirected, the weights are symmetric (w ij = w ji ). An edge weight w ij = 0 means "no edge between v i and v j ". Neighborhoods The set of all neighbors of v j in the graph, v i (i) := { j w ij 0} is called the neighborhood of v j. purple = (i) Advanced Machine Learning 59 / 212

61 MARKOV RANDOM FIELDS Given a neighborhood graph N, associate with each site v i V N a RV Θ i. The Markov property We say that the joint distribution P of (Θ 1,..., Θ n) satisfies the Markov property with respect to N if L(Θ i Θ j, j i) = L(Θ i Θ j, j (i)). The set {Θ j, j (i)} of random variables indexed by neighbors of v i is called the Markov blanket of Θ i. In words The Markov property says that each Θ i is conditionally independent of the remaining variables given its Markov blanket. Definition A distribution L(Θ 1,..., Θ n) which satisfies the Markov property for a given graph N is called a Markov random field. Θ i Markov blanket of Θ i Advanced Machine Learning 60 / 212

62 ENERGY FUNCTIONS Probabilities and energies A (strictly positive) density p(x) can always be written in the form and Z is a normalization constant. p(x) = 1 Z exp( H(x)) where H : X R + The function H is called an energy function, or cost function, or a potential. MRF energy In particular, we can write a MRF density for RVs Θ 1:n as p(θ 1,..., θ n) = 1 Z exp( H(θ 1,..., θ n)) Advanced Machine Learning 61 / 212

63 CLIQUES Graphical models factorize over the graph. How does that work for MRFs? A clique in a graph is a fully connected subgraph. In undirected graphs: Clique Not a clique The cliques in this graph are: i) The triangles (1, 2, 3), (1, 3, 4). ii) Each pair of vertices connected by an edge (e.g. (2, 6)). Advanced Machine Learning 62 / 212

64 THE HAMMERSLEY-CLIFFORD THEOREM Theorem Let N be a neighborhood graph with vertex set V N. Suppose the random variables {Θ i, i V N } take values in T, and their joint distribution has probability mass function P, so there is an energy function H such that e H(θ 1,...,θ n) P(θ 1,..., θ n) = i n θ i T e H(θ 1,...,θ. n) The joint distribution has the Markov property if and only if H(θ 1, θ 2,...) = C C H C (θ i, i C), where C is the set of cliques in N, and each H C is a non-negative function with C arguments. Hence, P(θ 1,..., θ n) = e Hc(θ i,i C) C C i C θ i T e Hc(θ i,i C) Markov random fields factorize over cliques. Advanced Machine Learning 63 / 212

65 USE OF MRFS In general Modeling systems of dependent RVs is one of the hardest problems in probability. MRFs model dependence, but break it down to a limited number of interactions to make the model tractable.... MRFs on grids... Θj 1 Θj Θj+1... By far the most widely used neighborhood graphs are 2-dimensional grids. MRFs on grids are used in spatial statistics to model spatial interactions between RVs. Hammersley-Clifford for grids: The only cliques are the edges! Θi 1 Θk 1 Θi Θk Θi+1 Θk dimensional grid graph with 4-neighborhoods MRFs on grids factorize over edges. Advanced Machine Learning 64 / 212

66 THE POTTS MODEL Definition Suppose N = (V N, W N ) a neighborhood graph with n vertices and β > 0 a constant. Then 1 ( p(θ 1:n ) := Z(β, W N ) exp β ) w ij I{θ i = θ j } i,j defines a joint distribution of n random variables Θ 1,..., Θ n. This distribution is called the Potts model. Interpretation If w ij > 0: The overall probability increases if Θ i = Θ j. If w ij < 0: The overall probability decreases if Θ i = Θ j. If w ij = 0: No interaction between Θ i and Θ j. Positive weights encourage smoothness. Advanced Machine Learning 65 / 212

67 EXAMPLE Ising model The simplest choice is w ij = 1 if (i, j) is an edge. p(θ 1:n ) = 1 Z(β) exp ( (i,j) is an edge ) βi{θ i = θ j } Θj 1 Θi 1. Θj Θi. Θj+1 Θi If N is a d-dim. grid, this model is called the Ising model.... Θk 1 Θk Θk Example Samples from an Ising model on a grid graph. Increasing β Advanced Machine Learning 66 / 212

68 MRFS AS SMOOTHNESS PRIORS We consider a spatial problem with observations X i. Each i is a location on a grid. Spatial model Suppose we model each X i by a distribution L(X Θ i ), i.e. each location i has its own parameter variable Θ i. This model is Bayesian (the parameter is a random variable). We use an MRF as prior distribution. Xj Xj+1 observed Xi Xi+1 p(. θi) Θj Θj+1 unobserved Θi Θi+1 We can think of L(X Θ i ) as an emission probability, similar to an HMM. Spatial smoothing We can define the joint distribution (Θ 1,..., Θ n) as a MRF on the grid graph. For positive weights, the MRF will encourage the model to explain neighbors X i and X j by the same parameter value. Spatial smoothing. Advanced Machine Learning 67 / 212

69 EXAMPLE: SEGMENTATION OF NOISY IMAGES Mixture model A BMM can be used for image segmentation. The BMM prior on the component parameters is a natural conjugate prior q(θ). In the spatial setting, we index the parameter of each X i separately as θ i. For K mixture components, θ 1:n contains only K different values. The joint BMM prior on θ 1:n is Smoothing term q BMM(θ 1:n ) = n q(θ i ). i=1 We multiply the BMM prior q BMM(θ) with an MRF prior q MRF(θ 1:n ) = 1 ( Z(β) exp β ) I{θ i = θ j } w ij 0 This encourages spatial smoothnes of the segmentation. Input image. Segmentation w/o smoothing. Segmentation with MRF smoothing. Advanced Machine Learning 68 / 212

70 SAMPLING AND INFERENCE MRFs pose two main computational problems. Problem 1: Sampling Generate samples from the joint distribution of (Θ 1,..., Θ n). Problem 2: Inference If the MRF is used as a prior, we have to compute or approximate the posterior distribution. Solution MRF distributions on grids are not analytically tractable. The only known exception is the Ising model in 1 dimension. Both sampling and inference are based on Markov chain sampling algorithms. Advanced Machine Learning 69 / 212

71 SAMPLING ALGORITHMS

72 SAMPLING ALGORITHMS In general A sampling algorithm is an algorithm that outputs samples X 1, X 2,... from a given distribution P or density p. Sampling algorithms can for example be used to approximate expectations: E p[ f (X)] 1 n f (X i ) n i=1 Inference in Bayesian models Suppose we work with a Bayesian model whose posterior ˆQ n := L(Θ X 1:n ) cannot be computed analytically. We will see that it can still be possible to sample from ˆQ n. Doing so, we obtain samples Θ 1, Θ 2,... distributed according to ˆQ n. This reduces posterior estimation to a density estimation problem (i.e. estimate ˆQ n from Θ 1, Θ 2,...). Advanced Machine Learning 71 / 212

73 PREDICTIVE DISTRIBUTIONS Posterior expectations If we are only interested in some statistic of the posterior of the form EˆQ n [ f (Θ)] (e.g. the posterior mean), we can again approximate by EˆQ n [ f (Θ)] 1 m f (Θ i ). m i=1 Example: Predictive distribution The posterior predictive distribution is our best guess of what the next data point x n+1 looks like, given the posterior under previous observations. In terms of densities: p(x n+1 x 1:n ) := p(x n+1 θ)ˆq n(dθ X 1:n = x 1:n ). T This is one of the key quantities of interest in Bayesian statistics. Computation from samples The predictive is a posterior expectation, and can be approximated as a sample average: p(x n+1 x 1:n ) = EˆQ n [ p(x n+1 Θ)] 1 m p(x n+1 Θ i ) m i=1 Advanced Machine Learning 72 / 212

74 BASIC SAMPLING: AREA UNDER CURVE Say we are interested in a probability density p on the interval [a, b]. p(y) A Y i a X i b x Key observation Suppose we can define a uniform distribution U A on the blue area A under the curve. If we sample (X 1, Y 1 ), (X 2, Y 2 ),... iid U A and discard the vertical coordinates Y i, the X i are distributed according to p, X 1, X 2,... iid p. Problem: Defining a uniform distribution is easy on a rectangular area, but difficult on an arbritrarily shaped one. Advanced Machine Learning 73 / 212

75 REJECTION SAMPLING ON THE INTERVAL Solution: Rejection sampling We can enclose p in box, and sample uniformly from the box B. p(x) c B a b x We can sample (X i, Y i ) uniformly on B by sampling X i Uniform[a, b] and Y i Uniform[0, c]. If (X i, Y i ) A, keep the sample. That is: If Y i p(x i ). Otherwise: Discard it ("reject" it). Result: The remaining (non-rejected) samples are uniformly distributed on A. Advanced Machine Learning 74 / 212

76 SCALING This strategy still works if we scale the vertically by some constant k > 0. k c c B B a Consequence b x We simply draw Y i Uniform[0, kc] instead of Y i Uniform[0, c]. For sampling, it is sufficient if p is known only up to normalization (only the shape of p is known). a b x Advanced Machine Learning 75 / 212

77 DISTRIBUTIONS KNOWN UP TO SCALING Sampling methods usually assume that we can evaluate the target distribution p up to a constant. That is: p(x) = p(x), 1 Z and we can compute p(x) for any given x, but we do not know Z. We have to pause for a moment and convince ourselves that there are useful examples where this assumption holds. Example 1: Simple posterior For an arbitrary posterior computed with Bayes theorem, we could write n i=1 Π(θ x 1:n ) = p(x i θ)q(θ) n with Z = p(x i θ)q(θ)dθ. Z T i=1 Provided that we can compute the numerator, we can sample without computing the normalization integral Z. Advanced Machine Learning 76 / 212

78 DISTRIBUTIONS KNOWN UP TO SCALING Example 2: Bayesian Mixture Model Recall that the posterior of the BMM is (up to normalization): ˆq n(c 1:K, θ 1:K x 1:n ) n ( K )( K ) c k p(x i θ k ) q(θ k λ, y) q Dirichlet(c 1:K ) i=1 k=1 We already know that we can discard the normalization constant, but can we evaluate the non-normalized posterior q n? The problem with computing q n (as a function of unknowns) is that the term n i=1( K k=1... ) blows up into K n individual terms. If we evaluate q n for specific values of c, x and θ, K k=1 c kp(x i θ k ) collapses to a single number for each x i, and we just have to multiply those n numbers. So: Computing q n as a formula in terms of unknowns is difficult; evaluating it for specific values of the arguments is easy. k=1 Advanced Machine Learning 77 / 212

79 DISTRIBUTIONS KNOWN UP TO SCALING Example 3: Markov random field In a MRF, the normalization function is the real problem. For example, recall the Ising model: p(θ 1:n ) = 1 Z(β) exp ( (i,j) is an edge ) βi{θ i = θ j } The normalization function is ( ) Z(β) = exp βi{θ i = θ j } θ 1:n {0,1} n (i,j) is an edge and hence a sum over 2 n terms. The general Potts model is even more difficult. On the other hand, evaluating ( p(θ 1:n ) = exp for a given configuration θ 1:n is straightforward. (i,j) is an edge ) βi{θ i = θ j } Advanced Machine Learning 78 / 212

80 REJECTION SAMPLING ON R d If we are not on the interval, sampling uniformly from an enclosing box is not possible (since there is no uniform distribution on all of R or R d ). Solution: Proposal density Instead of a box, we use another distribution r to enclose p: p(x) B x To generate B under r, we apply similar logic as before backwards: Sample X i r. Sample Y i X i Uniform[0, r(x i )]. r is always a simple distribution which we can sample and evaluate. Advanced Machine Learning 79 / 212

81 REJECTION SAMPLING ON R d p(x) B x Choose a simple distribution r from which we know how to sample. Scale p such that p(x) < r(x) everywhere. Sampling: For i = 1, 2,...,: 1. Sample X i r. 2. Sample Y i X i Uniform[0, r(x i )]. 3. If Y i < p(x i ), keep X i. 4. Else, discard X i and start again at (1). The surviving samples X 1, X 2,... are distributed according to p. Advanced Machine Learning 80 / 212

82 FACTORIZATION PERSPECTIVE The rejection step can be interpreted in terms of probabilities and densities. Factorization We factorize the target distribution or density p as p(x) = r(x) A(x) Sampling from the factorization X = X Z where X r and Z X Bernoulli(A(X )) distribution from which we know how to sample probability function we can evaluate once a specific value of x is given Sampling Bernoulli variables with uniform variables Z X Bernoulli(A(X )) Z = I{U < A(X )} where U Uniform[0, 1]. Advanced Machine Learning 81 / 212

83 INDEPENDENCE If we draw proposal samples X i i.i.d. from r, the resulting sequence of accepted samples produced by rejection sampling is again i.i.d. with distribution p. Hence: Rejection samplers produce i.i.d. sequences of samples. Important consequence If samples X 1, X 2,... are drawn by a rejection sampler, the sample average 1 m m f (X i ) (for some function f ) is an unbiased estimate of the expectation E p[ f (X)]. i=1 Advanced Machine Learning 82 / 212

84 EFFICIENCY The fraction of accepted samples is the ratio A of the areas under the curves p and r. B p(x) x If r is not a reasonably close approximation of p, we will end up rejecting a lot of proposal samples. Advanced Machine Learning 83 / 212

85 AN IMPORTANT BIT OF IMPRECISE INTUITION Example figures for sampling methods tend to look like this. A high-dimensional distribution of correlated RVs will look rather more like this. Sampling is usually used in multiple dimensions. Reason, roughly speaking: Intractable posterior distributions arise when there are several interacting random variables. The interactions make the joint distribution complicated. In one-dimensional problems (1 RV), we can usually compute the posterior analytically. Independent multi-dimensional distributions factorize and reduce to one-dimensional case. Warning: Avoid sampling if you can solve analytically. Advanced Machine Learning 84 / 212

86 WHY IS NOT EVERY SAMPLER A REJECTION SAMPLER? We can easily end up in situations where we accept only one in 10 6 (or 10 10, or 10 20,... ) proposal samples. Especially in higher dimensions, we have to expect this to be not the exception but the rule. Advanced Machine Learning 85 / 212

87 IMPORTANCE SAMPLING The rejection problem can be fixed easily if we are only interested in approximating an expectation E p[ f (X)]. Simple case: We can evaluate p Suppose p is the target density and q a proposal density. An expectation under p can be rewritten as E p[ f (X)] = f (x)p(x)dx = f (x) p(x) [ f (X)p(X) q(x)dx = Eq q(x) q(x) Importance sampling We can sample X 1, X 2,... from q and approximate E p[ f (X)] as m E p[ f (X)] 1 m i=1 There is no rejection step; all samples are used. f (X i ) p(x i) q(x i ) This method is called importance sampling. The coefficients p(x i) are called importance q(x i ) weights. ] Advanced Machine Learning 86 / 212

88 IMPORTANCE SAMPLING General case: We can only evaluate p In the general case, p = 1 Z p p and q = 1 Z q q, and Z p (and possibly Z q) are unknown. We can write Zp as Z q q(x) Z p p(x)dx p(x) q(x) = = dx q(x) = p(x) Z q Z q Z q Z q q(x) Approximating the constants The fraction Zp Z q can be approximated using samples x 1:m from q: [ ] Z p p(x) = E q Z q q(x) Approximating E p [ f (X)] dx = Eq 1 m p(x i ) m q(x i=1 i ) [ ] p(x) q(x) m E p[ f (X)] 1 m i=1 f (X i ) p(x i) q(x i ) m = 1 m i=1 f (X i ) Zq Z p p(x i ) q(x i ) = m m i=1 j=1 f (X i ) p(x i) q(x i ) p(x j ) q(x j ) Advanced Machine Learning 87 / 212

89 IMPORTANCE SAMPLING IN GENERAL Conditions Given are a target distribution p and a proposal distribution q. p = 1 Z p p and q = 1 Z q q. We can evaluate p and q, and we can sample q. The objective is to compute E p[ f (X)] for a given function f. Algorithm 1. Sample X 1,..., X m from q. 2. Approximate E p[ f (X)] as E p[ f (X)] m i=1 f (X i) p(x i) q(x i ) m j=1 p(x j ) q(x j ) Advanced Machine Learning 88 / 212

90 MARKOV CHAIN MONTE CARLO

91 MOTIVATION Suppose we rejection-sample a distribution like this: region of interest Once we have drawn a sample in the narrow region of interest, we would like to continue drawing samples within the same region. That is only possible if each sample depends on the location of the previous sample. Proposals in rejection sampling are i.i.d. Hence, once we have found the region where p concentrates, we forget about it for the next sample. Advanced Machine Learning 90 / 212

92 MCMC: IDEA Recall: Markov chain A sufficiently nice Markov chain (MC) has an invariant distribution P inv. Once the MC has converged to P inv, each sample X i from the chain has marginal distribution P inv. Markov chain Monte Carlo We want to sample from a distribution with density p. Suppose we can define a MC with invariant distribution P inv p. If we sample X 1, X 2,... from the chain, then once it has converged, we obtain samples X i p. This sampling technique is called Markov chain Monte Carlo (MCMC). Note: For a Markov chain, X i+1 can depend on X i, so at least in principle, it is possible for an MCMC sampler to "remember" the previous step and remain in a high-probability location. Advanced Machine Learning 91 / 212

93 CONTINUOUS MARKOV CHAIN The Markov chains we discussed so far had a finite state space X. For MCMC, state space now has to be the domain of p, so we often need to work with continuous state spaces. Continuous Markov chain A continuous Markov chain is defined by an initial distribution P init and conditional probability t(y x), the transition probability or transition kernel. In the discrete case, t(y = i x = j) is the entry p ij of the transition matrix p. Example: A Markov chain on R 2 We can define a very simple Markov chain by sampling x i X i+1 X i = x i g(. x i, σ 2 ) where g(x µ, σ 2 ) is a spherical Gaussian with fixed variance. In other words, the transition distribution is t(x i+1 x i ) := g(x i+1 x i, σ 2 ). A Gaussian (gray contours) is placed around the current point x i to sample X i+1. Advanced Machine Learning 92 / 212

94 INVARIANT DISTRIBUTION Recall: Finite case The invariant distribution P inv is a distribution on the finite state space X of the MC (i.e. a vector of length X ). "Invariant" means that, if X i is distributed according to P inv, and we execute a step X i+1 t(. x i ) of the chain, then X i+1 again has distribution P inv. In terms of the transition matrix p: Continuous case X is now uncountable (e.g. X = R d ). p P inv = P inv The transition matrix p is substituted by the conditional probability t. A distribution P inv with density p inv is invariant if t(y x)p inv(x)dx = p inv(y) X This is simply the continuous analogue of the equation i p ij(p inv) i = (P inv) j. Advanced Machine Learning 93 / 212

95 MARKOV CHAIN SAMPLING We run the Markov chain n for steps. Each step moves from the current location x i to a new x i+1. Problems we need to solve We "forget" the order and regard the locations x 1:n as a random set of points. 1. We have to construct a MC with invariant distribution p. If p (red contours) is both the invariant and initial distribution, each X i is distributed as X i p. 2. We cannot actually start sampling with X 1 p; if we knew how to sample from p, all of this would be pointless. 3. Each point X i is marginally distributed as X i p, but the points are not i.i.d. Advanced Machine Learning 94 / 212

96 CONSTRUCTING THE MARKOV CHAIN Given is a continuous target distribution with density p. Metropolis-Hastings (MH) kernel 1. We start by defining a conditional probability q(y x) on X. q has nothing to do with p. We could e.g. choose q(y x) = g(y x, σ 2 ), as in the previous example. 2. We define a rejection kernel A as { A(x n+1 x n) := min 1, q(x i x i+1 )p(x i+1 ) } q(x i+1 x i )p(x i ) total probability that a proposal is sampled The normalization of p cancels in the quotient, so knowing p is again enough. and then rejected 3. We define the transition probability of the chain as t(x i+1 x i ) := q(x i+1 x i )A(x i+1 x i )+δ xi (x i+1 )c(x i ) where c(x i ) := q(y x i )(1 A(y x i ))dy Sampling from the MH chain At each step i + 1, generate a proposal X q(. x i ) and U i Uniform[0, 1]. If U i A(x x i ), accept proposal: Set x i+1 := x. If U i > A(x x i ), reject proposal: Set x i+1 := x i. Advanced Machine Learning 95 / 212

97 PROBLEM 1: INITIAL DISTRIBUTION Recall: Fundamental theorem on Markov chains Suppose we sample X 1 P init and X i+1 t(. x i ). This defines a distribution P i of X i, which can change from step to step. If the MC is nice (recall: recurrent and aperiodic), then P i P inv for i. Note: Making precise what aperiodic means in a continuous state space is a bit more technical than in the finite case, but the theorem still holds. We will not worry about the details here. Implication If we can show that P inv p, we do not have to know how to sample from p. Instead, we can start with any P init, and will get arbitrarily close to p for sufficiently large i. Advanced Machine Learning 96 / 212

98 BURN-IN AND MIXING TIME The number m of steps required until P m P inv p is called the mixing time of the Markov chain. (In probability theory, there is a range of definitions for what exactly P m P inv means.) In MC samplers, the first m samples are also called the burn-in phase. The first m samples of each run of the sampler are discarded: X 1,..., X m 1, X m, X m+1,... Burn-in; discard. Samples from (approximately) p; keep. Convergence diagnostics In practice, we do not know how large j is. There are a number of methods for assessing whether the sampler has mixed. Such heuristics are often referred to as convergence diagnostics. Advanced Machine Learning 97 / 212

99 PROBLEM 2: SEQUENTIAL DEPENDENCE Even after burn-in, the samples from a MC are not i.i.d. Strategy Estimate empirically how many steps L are needed for x i and x i+l to be approximately independent. The number L is called the lag. After burn-in, keep only every Lth sample; discard samples in between. Estimating the lag The most commen method uses the autocorrelation function: Auto(x i, x j ) := E[x i µ i ] E[x j µ j ] σ i σ j We compute Auto(x i, x i+l ) empirically from the sample for different values of L, and find the smallest L for which the autocorrelation is close to zero. Auto(xi, x i+l ) L Advanced Machine Learning 98 / 212

100 CONVERGENCE DIAGNOSTICS There are about half a dozen popular convergence crieteria; the one below is an example. Gelman-Rubin criterion Start several chains at random. For each chain k, sample X k i has a marginal distribution P k i. The distributions of P k i will differ between chains in early stages. Once the chains have converged, all P i = P inv are identical. Criterion: Use a hypothesis test to compare P k i for different k (e.g. compare P 2 i against null hypothesis P 1 i ). Once the test does not reject anymore, assume that the chains are past burn-in. Reference: A. Gelman and D. B. Rubin: "Inference from Iterative Simulation Using Multiple Sequences", Statistical Science, Vol. 7 (1992) Advanced Machine Learning 99 / 212

101 STOCHASTIC HILL-CLIMBING The Metropolis-Hastings rejection kernel was defined as: { A(x n+1 x n) = min 1, q(x i x i+1 )p(x i+1 ) q(x i+1 x i )p(x i ) Hence, we certainly accept if the second term is larger than 1, i.e. if That means: q(x i x i+1 )p(x i+1 ) > q(x i+1 x i )p(x i ). We always accept the proposal value x i+1 if it increases the probability under p. If it decreases the probability, we still accept with a probability which depends on the difference to the current probability. Hill-climbing interpretation The MH sampler somewhat resembles a gradient ascent algorithm on p, which tends to move in the direction of increasing probability p. However: }. The actual steps are chosen at random. The sampler can move "downhill" with a certain probability. When it reaches a local maximum, it does not get stuck there. Advanced Machine Learning 100 / 212

102 SELECTING A PROPOSAL DISTRIBUTION Everyone s favorite example: Two Gaussians red = target distribution p gray = proposal distribution q Var[q] too large: Will overstep p; many rejections. Var[q] too small: Many steps needed to achieve good coverage of domain. If p is unimodal and can be roughly approximated by a Gaussian, Var[q] should be chosen as smallest covariance component of p. More generally For complicated posteriors (recall: small regions of concentration, large low-probability regions in between) choosing q is much more difficult. To choose q with good performance, we already need to know something about the posterior. There are many strategies, e.g. mixture proposals (with one component for large steps and one for small steps). Advanced Machine Learning 101 / 212

103 SUMMARY: MH SAMPLER MCMC samplers construct a MC with invariant distribution p. The MH kernel is one generic way to construct such a chain from p and a proposal distribution q. Formally, q does not depend on p (but arbitrary choice of q usually means bad performance). We have to discard an initial number m of samples as burn-in to obtain samples (approximately) distributed according to p. After burn-in, we keep only every Lth sample (where L = lag) to make sure the x i are (approximately) independent. Keep. Keep. Keep. X 1,..., X m 1, X m, X m+1,..., X m+l 1, X m+l, X m+l+1,... X m+2l 1, X m+2l,... Burn-in; discard. Samples correlated with X j ; discard. Samples correlated with X j+l ; discard. Advanced Machine Learning 102 / 212

104 THE GIBBS SAMPLER

105 GIBBS SAMPLING By far the most widely used MCMC algorithm is the Gibbs sampler. Full conditionals Suppose L(X) is a distribution on R D, so X = (X 1,..., X D). The conditional probability of the entry X d given all other entries, is called the full conditional distribution of X d. L(X d X 1,..., X d 1, X d+1,..., X D) On R D, that means we are interested in a density p(x d x 1,..., x d 1, x d+1,..., x D) Gibbs sampling The Gibbs sampler is the special case of the Metropolis-Hastings algorithm defined by propsoal distribution for X d = full conditional of X d. Gibbs sampling is only applicable if we can compute the full conditionals for each dimension d. If so, it provides us with a generic way to derive a proposal distribution. Advanced Machine Learning 104 / 212

106 THE GIBBS SAMPLER Proposal distribution Suppose p is a distribution on R D, so each sample is of the form X i = (X i,1,..., X i,d ). We generate a proposal X i+1 coordinate-by-coordinate as follows: X i+1,1 p(. x i,2,..., x i,d ). X i+1,d p(. x i+1,1,..., x i+1,d 1, x i,d+1,..., x i,d ). X i+1,d p(. x i+1,1,..., x i+1,d 1 ) Note: Each new X i+1,d is immediately used in the update of the next dimension d + 1. A Metropolis-Hastings algorithms with proposals generated as above is called a Gibbs sampler. No rejections It is straightforward to show that the Metropolis-Hastings acceptance probability for each x i+1,d+1 is 1, so proposals in Gibbs sampling are always accepted. Advanced Machine Learning 105 / 212

107 EXAMPLE: MRF In a MRF with D nodes, each dimension d corresponds to one vertex. Θ up Full conditionals In a grid with 4-neighborhoods for instance, the Markov property implies that p(θ d θ 1,..., θ d 1, θ d+1,..., θ D) = p(θ d θ left, θ right, θ up, θ down) Θ left Θ d Θ right Specifically: Potts model with binary weights Recall that, for sampling, knowing only p (unnormalized) is sufficient: Θ down p(θ d θ 1,..., θ d 1, θ d+1,..., θ D) = ( ) exp β(i{θ d = θ left} + I{θ d = θ right} + I{θ d = θ up} + I{θ d = θ down} This is clearly very efficiently computable. Advanced Machine Learning 106 / 212

108 EXAMPLE: MRF Sampling the Potts model Each step of the sampler generates a sample where D is the number of vertices in the grid. θ i = (θ i,1,..., θ i,d ), Gibbs sampler Each step of the Gibbs sampler generates n updates according to θ i+1,d p(. θ i+1,1,..., θ i+1,d 1, θ i,d+1,..., θ i,d ) ( ) exp β(i{θ i+1,d = θ left} + I{θ i+1,d = θ right} + I{θ i+1,d = θ up} + I{θ i+1,d = θ down}) Advanced Machine Learning 107 / 212

109 BURN-IN MATTERS This example is due to Erik Sudderth (UC Irvine). MRFs as "segmentation" priors -..&/0"1$023%4 MRFs where introduced as tools for image smoothing and segmentation by D. and S. Geman in They sampled from a Potts model with a Gibbs sampler, discarding 200 iterations as burn-in. Such a sample (after 200 steps) is shown above, for a Potts model in which each variable can take one out of 5 possible values. These patterns led computer vision researchers to conclude that MRFs are "natural" priors for image segmentation, since samples from the MRF resemble a segmented image. Advanced Machine Learning 108 / 212

110 EXAMPLE: BURN-IN MATTERS E. Sudderth ran a Gibbs sampler on the same model and sampled after 200 iterations (as the Geman brothers did), and again after iterations: -..&/0"1$023%4& 200 iterations ).(...&/0"1$023%4& iterations Chain 1 Chain 5 The "segmentation" patterns are not sampled from the MRF distribution p P inv, but rather from P 200 P inv. The patterns occur not because MRFs are "natural" priors for segmentations, but because the sampler s Markov chain has not mixed. MRFs are smoothness priors, not segmentation priors. Advanced Machine Learning 109 / 212

111 VARIATIONAL INFERENCE

112 VARIATIONAL INFERENCE: IDEA Problem We have to solve an inference problem where the correct solution is an intractable distribution with density p (e.g. a complicated posterior in a Bayesian inference problem). Variational approach Approximate p as q := arg min q Q φ(q, p ) where Q is a class of simple distributions and φ is a cost function (small φ means good fit). That turns the inference problem into a constrained optimization problem min φ(q, p ) s.t. q Q Variational inference approximates a complicated distribution by minimizing the distance (or discrepancy) to a class of tractable distributions. Advanced Machine Learning 111 / 212

113 BACKGROUND: VARIATIONAL METHODS Recall: Optimization approach to problems Formulate your problem such that the solution x R d is the minimum of some function f, and solve x := arg min x R d f (x) possibly under constraints. Examples: Support vector machines, linear regression, logistic regression,... Inference problem as above q := arg min q Q φ(q, p ) q is now a function (a density), not a point in R d. We have to optimize over a space of functions. Such spaces are in general infinite-dimensional. Often: Q is a parametric model, with parameter space T R d reduces to optimization over R d. However: Optimization over infinite-dimensional spaces is in principle possible. Advanced Machine Learning 112 / 212

114 Advanced Machine Learning Not examinable. 113 / 212 OPTIMIZATION OF FUNCTIONALS Let F be a space of functions (e.g. all continuous functions on R). A function φ : F R (a function whose arguments are functions) is called a functional. Examples: (1) The integral of a function. (2) The differential entropy of a density. Recall: Derivatives (on R) The differential of f : R d R at point x is δf (x) = lim ε 0 f (x + ε) f (x) ε f (x + x) f (x) if d = 1 or δf (x) = lim x 0 x The d-dimensional case works by reducing to the 1-dimensional case using a norm. Derivatives of functionals in general. If F is a function space and a norm on F, we can apply the same idea to φ : F R: φ( f + f ) φ( f ) δφ( f ) := lim f 0 f δφ(f ) is called the Fréchet derivative of φ at f. f is a minimum of a Fréchet-differentiable functional φ only if δφ(f ) = 0.

115 Advanced Machine Learning Not examinable. 114 / 212 OPTIMIZATION OF FUNCTIONALS Optimization We can in principle find a minimum of φ by gradient descent: Add increment functions f k in the direction of δφ(f k ) to the current solution candidate f k. The maximum entropy problem is often cited as an example. Horseshoes We have to represent the infinite-dimensional quantities f k and f k in some way. Many interesting functionals φ are not Fréchet-differentiable as functionals on F. They only become differentiable when constrained to a much smaller subspace. One solution is variational calculus, an analytic technique that addresses both problems. (We will not need the details.) Recall: Maximum entropy principle The maximum entropy principle chooses a distribution within some set P of candidates by selecting the one with the largest entropy. That is: It solves the optimization problem max H(p) s.t. p P For example, if P are all those distributions under which some given statistic S takes a given expected value, we obtain exponential family distributions with sufficient statistic S.

116 Advanced Machine Learning Not examinable. 115 / 212 OPTIMIZATION OF FUNCTIONALS Maximum entropy as functional optimization The entropy H assigns a scalar to a distribution functional! Problem: The entropy as a functional e.g. on all distributions on R is concave, but it is not differentiable; it is not even continuous. The solution for exponential families can be determined using variational calculus. Functional optimization in machine learning We will be interested in problems of the form where Q is a parametric family. min φ(q) q s.t. q Q That means each element of Q is of the form q( θ), for θ T R d. The problem then reduces back to optimization in R d : We can apply gradient descent, Newton, etc. min φ(q( θ)) θ s.t. θ T

117 KULLBACK-LEIBLER DIVERGENCE Recall The information in observing X = x under a probability mass function P is J P (x) := log 1 = log P(x). P(x) Its expectation H(P) := E P [J P (X)] is the entropy of P. The Kullback-Leibler divergence of P and Q is D KL(P Q) := E P [J Q (X)] H(P) = x Entropy and KL divergence for densities P(x) log P(x) Q(x) If p and q are probability densities, then H(p) := p(x) log p(x)dx and D KL(p q) := p(x) log p(x) q(x) dx are the differential entropy of p and the Kullback-Leibler divergence of p and q. Be careful The differential entropy does not behave like the entropy (e.g. it can be negative). The KL divergence for densities has properties analogous to the mass function case. Advanced Machine Learning 116 / 212

118 VARIATIONAL INFERENCE WITH KL Recall VI optimization problem We have to choose a cost function. q := arg min q Q φ(q, p ) The term variational inference in machine learning typically implies φ is a KL divergence, Order of the arguments Recall that D KL is not symmetric, so Which order should we use? q := arg min q Q DKL(q, p ) D KL(q, p ) D KL(p, q) Recall D KL(p q) is an expectation with respect to p. D KL(p q) emphasizes regions where the true model p has high probability. That is what we should use if possible. We use VI because p is intractable, so we can usually not compute expectations under it. We use the expectation D KL(q, p ) under the approximating simpler model instead. We have to understand the implications of this choice. Advanced Machine Learning 117 / 212

119 EXAMPLE Approximating a Gaussian by a spherical Gaussian What VI would do if possible 1 What VI does D (b) KL(p q) = D KL(green red) D KL(q p ) = D KL(red green) Advanced Machine Learning Illustration: Bishop (2006) 118 / 212

120 EXAMPLE Approximating a Gaussian mixture by a single Gaussian What VI would do if possible What VI does D KL(p q) = D KL(blue red) D KL(q p ) = D KL(red blue) Advanced Machine Learning Illustration: Bishop (2006) 119 / 212

121 VI FOR POSTERIOR DISTRIBUTIONS Often: Target distribution is a posterior of a parameter or latent variable Z, given data x. Basic approximation problem If the posterior density is p (z) = p(z x) = p(x z)p(z) p(x), then q ( ) = arg min DKL(q( ) p( x)). q Q Transforming the objective function [ D KL(q( ) p( x)) = E log q(z) ] p(z x) = E[log q(z)] E[log p(z x)] = E[log q(z)] E[log p(z, x)] + log p(x) The evidence p(x) is hard to compute (it is an integral over p (x z)). It depends only on x, so it is an additive constant w.r.t. the optimization problem. Dropping it from the objective function does not change the location of the minimum. F(q) := E[log q(z)] E[log p(z, x)] Advanced Machine Learning 120 / 212

122 VI FOR POSTERIOR DISTRIBUTIONS Summary: VI approximation min s.t. F(q) q Q where F(q) = E[log q(z)] E[log p(z, x)] Terminology The function F is called a free energy in statistical physics. Since there are different forms of free energies, various authors attach different adjectives (variational free energy, Helmholtz free energy, etc). Parts of the machine learning literature have renamed F, by maximizing the objective function F and calling it an evidence lower bound, since F(q) + D KL(q p( x)) = log p(x) hence e F(q) p(x), and p(x) is the evidence in the Bayes equation. Advanced Machine Learning 121 / 212

123 MEAN FIELD APPROXIMATION Definition A variational approximation of a probability distribution p on a d-dimensional space q := arg min q Q DKL(q, p ) is called mean field approximation if each element of Q is factorial, In previous example q(z) = q 1 (z 1 )... q d (z d ). Mean field (Gaussian spherical) Not a mean field Advanced Machine Learning 122 / 212

124 EXAMPLE: MEAN FIELD FOR THE POTTS MODEL Model We consider a MRF distribution for X 1,..., X n with values in { 1, +1}, given by P(X 1,..., X n) = 1 Z(β) exp( βh(x 1,..., X n)) n n where H = w ij X i X j h i X i i,j=1 i=1 w ij and h i are constant weights. Physicists call this a Potts model with an external magnetic field. Variational approximation We choose Q as the family { n } mi Q := Q mi [ 1, 1] i=1 Each factor is a Bernoulli ( 1+m 2 { 1+m X = 1 where Q m(x) := 2 1 m X = 1 2 ), except that the range is { 1, 1} not {0, 1}. external field Advanced Machine Learning 123 / 212

125 EXAMPLE: MEAN FIELD FOR THE POTTS MODEL Optimization Problem ( n ) min D KL Q mi P i=1 s.t. m i [ 1, 1] for i = 1,..., n Mean field solution The mean field approximation is given by the parameter values m i satisfying the equations ( m i = tanh β ( n j=1 w ij m j + h i )). That is: For given values of w ij and h i, we have to solve for the values m i. Advanced Machine Learning 124 / 212

126 EXAMPLE: MEAN FIELD FOR THE POTTS MODEL We have approximated the MRF P(X 1,..., X n) by n ( ( n )) Bernoulli(m i ) satisfying m i = tanh β w ij m j +h i i=1 j=1 (where we interpret a 0 generated by the Bernoulli as a 1). Interpretation In the MRF P, the random variables X i interact. There is no interaction in the approximation. Instead, the effect of interactions is approximated by encoding them in the parameters. This is somewhat like a single effect ( field ) acting on all variables simultanuously ( mean field ). How accurate is the approximation? In physics, P is used to model a ferromagnet with an external magnetic field. In this case, β is the inverse temperature. These systems exhibit a phenomenon called spontanuous magnetization at certain temperatures. The mean field solution predicts spontanuous magnetization, but at the wrong temperature values. Advanced Machine Learning 125 / 212

127 DIRECTED GRAPHICAL MODELS: MIXTURES AND ADMIXTURES

128 NEXT: BAYESIAN MIXTURES AND ADMIXTURES We will consider two variations on finite mixtures: Bayesian mixture models (mixtures with priors). Admixtures, in which the generation of each observation (e.g. document) can be influenced by several components (e.g. topics). One particular admixture model, called latent Dirichlet allocation, is one of the most succesful machine learning models of the past ten years. Advanced Machine Learning 127 / 212

129 FINITE MIXTURE AS A GRAPHICAL MODELS π(x) = k K c k p(x θ k ) Sampling from this model 1. Fix c k and θ k for k = 1,..., K. 2. Generate Z i Multinomial(c 1,..., c K ). 3. Generate X i Z i p( θ Zi ). As a graphical model c Box notation indicates c and θ are not random Z 1... Z n θ X 1... X n Advanced Machine Learning 128 / 212

130 PLATE NOTATION If variables are sampled repeatedly in a graphical model, we enclose these variables in plate. X n This means: Draw n (conditionally) independent samples from X. Finite mixture with plate notation c c Z 1... Z n θ = Z X 1... X n θ X n Advanced Machine Learning 129 / 212

131 BAYESIAN MIXTURE MODEL Recall: Mixing distribution of a FMM K K π(x) = c k p(x θ k ) = p(x θ)m(θ)dθ with m := c k δ θk k=1 T k=1 All parameters are summarized in the mixing distribution m. Bayesian mixture model: Idea In a Bayesian model, parameters are random variables. Here, that means a random mixing distribution: M(. ) = K C k δ Θk (. ) k=1 Advanced Machine Learning 130 / 212

132 RANDOM MIXING DISTRIBUTION How can we define a random distribution? Since M is discrete with finitely many terms, we only have to generate the random variables C k and Θ k : M(. ) = K C k δ Θk (. ) More precisely Specifically, the term BMM implies that all priors are natural conjugate priors. That is: k=1 The mixture components p(x θ) are an exponential family model. The prior on each Θ k is a natural conjugate prior of p. The prior of the vector (C 1,..., C K ) is a Dirichlet distribution. Explanation: Dirichlet distribution When we sample from a finite mixture, we choose a component k from a multinomial distribution with parameter vector (c 1,..., c k ). The conjugate prior of the multinomial is the Dirichlet distribution. Advanced Machine Learning 131 / 212

133 BAYESIAN MIXTURE MODELS Definition A model of the form π(x) = K k=1 C k p(x Θ k ) = p(x θ)m(θ)dθ T is called a Bayesian mixture model if p(x θ) is an exponential family model and M a random mixing distribution, where: Θ 1,..., Θ K iid q(. λ, y), where q is a natural conjugate prior for p. (C 1,..., C K ) is sampled from a K-dimensional Dirichlet distribution. Advanced Machine Learning 132 / 212

134 BAYESIAN MIXTURE AS A GRAPHICAL MODEL Sampling from a Bayesian Mixture 1. Draw C = (C 1,..., C k ) from a Dirichlet prior. 2. Draw Θ 1,..., Θ K iid q, where q is the conjugate prior of p. 3. Draw Z i C Multinomial(C). 4. Draw X i Z i, Θ p( Θ Zi ). As a graphical model C Z Θ X n Advanced Machine Learning 133 / 212

135 BAYESIAN MIXTURE: INFERENCE Posterior distribution The posterior density of a BMM under observations x 1,..., x n is (up to normalization): Π(c 1:K, θ 1:K x 1:n ) n ( K )( K ) c k p(x i θ k ) q(θ k λ, y) q Dirichlet(c 1:K ) i=1 k=1 k=1 The posterior is analytically intractable Thanks to conjugacy, we can evaluate each term of the posterior. However: Due to the ( K n ) k=1 i=1... bit, the posterior has K n terms! Even for 10 clusters and 100 observations, that is impossible to compute. Advanced Machine Learning 134 / 212

136 GIBBS SAMPLER FOR THE BMM This Gibbs sampler is a bit harder to derive, so we skip the derivation and only look at the algorithm. Recall: Bayesian mixture model Exponential family likelihood p(x θ k ) for each cluster k = 1,..., K. Natural conjugate prior q for all θ k. Dirichlet prior Dirichlet(α, g) for the mixture weights c 1:K. Assignment probabilities Each step of the Gibbs sampler computes an assignment matrix: a a 1K ( ) a =.. = Pr{x i in cluster k} ik a n1... a nk Entries are computed as they are in the EM algorithm: a ik = C kp(x i Θ k ) K l=1 C lp(x i Θ l ) In contrast to EM, the values C k and Θ k are random. Advanced Machine Learning 135 / 212

137 GIBBS FOR BMM: ALGORITHM In each iteration j, the algorithm cycles through these steps: 1. For each x i, sample an assignment Z j i Multinomial(a j i1,..., a j ik ) where a j ik = exactly as in EM C j 1 k p(x i Θ j 1 k ) p(x i Θ j 1 l ) K l=1 C j 1 l 2. For each cluster k, sample a new value for Θ j k from the conjugate posterior Π(Θ k) under the observations currently assigned to k: Θ j k (Θ Π λ + n i=1 I{Z j i = k}, y + n i=1 ) I{Z j i = k}s(x i ) # points currently assigned to k aggregate S(x i ) over cluster k 3. Sample new cluster proportions C j 1:K from the Dirichlet posterior (under all x i): prior expectation of C k prior concentration # points currently assigned to k C j 1:K Dirichlet(α + n, g j 1:K ) where g j k = α g k + n i=1 I{Z j i = k} α + n normalization Advanced Machine Learning 136 / 212

138 COMPARISON: EM AND K-MEANS The BMM Gibbs sampler looks very similar to the EM algorithm, with maximization steps (in EM) substituted by posterior sampling steps: Representation of assignments Parameters EM Assignment probabilities a i,1:k a ik -weighted MLE K-means m i = arg max k (a i,1:k ) MLE for each cluster Gibbs for BMM m i Multinomial(a i,1:k ) Sample posterior for each cluster Advanced Machine Learning 137 / 212

139 TOOLS: THE DIRICHLET DISTRIBUTION

140 THE DIRICHLET DISTRIBUTION e 1 Recall: Probability simplex The set of all probability distributions on K events is the simplex c 1 K := {(c 1,..., c k ) R K c k 0 and k c K = 1}. Dirichlet distribution The Dirichlet distribution is the distribution on K with density 1 ( K q Dirichlet(c 1:K α, g 1:K ) := Z(α, g 1:K ) exp ) (αg k 1) log(c k ) k=1 Parameters: g 1:K K : Mean parameter, i.e. E[c 1:K ] = g 1:K. Note g 1:K is a probability distribution on K categories. α R + : Concentration. Larger α sharper concentration around g 1:K. c 3 c 2 e 2 e 3 Advanced Machine Learning 139 / 212

141 THE DIRICHLET DISTRIBUTION In all plots, g 1:K = ( 1 3, 1 3, 1 3 ). Light colors = large density values. Density plots As heat maps α = 1.8 α = 10 α = 0.8 Large density values at extreme points α = 1 Uniform distribution on K α = 1.8 Density peaks around its mean α = 10 Peak sharpens with increasing α Advanced Machine Learning 140 / 212

142 MULTINOMIAL-DIRICHLET MODEL Model The Dirichlet is the natural conjugate prior on the multinomial parameters. If we observe h k counts in category k, the posterior is Π(c 1:K h 1,..., h k ) = q Dirichlet(c 1:K α + n, (αg 1 + h 1,..., αg K + h K )) where n = k h k is the total number of observations. Illustration: One observation Suppose K = 3 and we obtain a single observation in category 3. This extreme point correponds to k = 3. Prior: Mean at the center. Posterior: Shifted mean, increased concentration. Advanced Machine Learning 141 / 212

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

SAMPLING ALGORITHMS. In general. Inference in Bayesian models SAMPLING ALGORITHMS SAMPLING ALGORITHMS In general A sampling algorithm is an algorithm that outputs samples x 1, x 2,... from a given distribution P or density p. Sampling algorithms can for example be

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Graphical Models and Kernel Methods

Graphical Models and Kernel Methods Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.

More information

17 : Markov Chain Monte Carlo

17 : Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling Professor Erik Sudderth Brown University Computer Science October 27, 2016 Some figures and materials courtesy

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning 10-715 Advanced Introduction to Machine Learning Homework 3 Due Nov 12, 10.30 am Rules 1. Homework is due on the due date at 10.30 am. Please hand over your homework at the beginning of class. Please see

More information

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 18, 2015 1 / 45 Resources and Attribution Image credits,

More information

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4 ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian

More information

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann (Feed-Forward) Neural Networks 2016-12-06 Dr. Hajira Jabeen, Prof. Jens Lehmann Outline In the previous lectures we have learned about tensors and factorization methods. RESCAL is a bilinear model for

More information

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning for for Advanced Topics in California Institute of Technology April 20th, 2017 1 / 50 Table of Contents for 1 2 3 4 2 / 50 History of methods for Enrico Fermi used to calculate incredibly accurate predictions

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As

More information

Chapter 16. Structured Probabilistic Models for Deep Learning

Chapter 16. Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem

More information

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Bayesian decision theory 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Jussi Tohka jussi.tohka@tut.fi Institute of Signal Processing Tampere University of Technology

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

Machine Learning, Fall 2009: Midterm

Machine Learning, Fall 2009: Midterm 10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods Prof. Daniel Cremers 14. Sampling Methods Sampling Methods Sampling Methods are widely used in Computer Science as an approximation of a deterministic algorithm to represent uncertainty without a parametric

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

The Ising model and Markov chain Monte Carlo

The Ising model and Markov chain Monte Carlo The Ising model and Markov chain Monte Carlo Ramesh Sridharan These notes give a short description of the Ising model for images and an introduction to Metropolis-Hastings and Gibbs Markov Chain Monte

More information

MCMC and Gibbs Sampling. Kayhan Batmanghelich

MCMC and Gibbs Sampling. Kayhan Batmanghelich MCMC and Gibbs Sampling Kayhan Batmanghelich 1 Approaches to inference l Exact inference algorithms l l l The elimination algorithm Message-passing algorithm (sum-product, belief propagation) The junction

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

MCMC algorithms for fitting Bayesian models

MCMC algorithms for fitting Bayesian models MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Machine Learning (CS 567) Lecture 2

Machine Learning (CS 567) Lecture 2 Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials) Markov Networks l Like Bayes Nets l Graphical model that describes joint probability distribution using tables (AKA potentials) l Nodes are random variables l Labels are outcomes over the variables Markov

More information

Statistical NLP for the Web

Statistical NLP for the Web Statistical NLP for the Web Neural Networks, Deep Belief Networks Sameer Maskey Week 8, October 24, 2012 *some slides from Andrew Rosenberg Announcements Please ask HW2 related questions in courseworks

More information

Bayesian Networks Inference with Probabilistic Graphical Models

Bayesian Networks Inference with Probabilistic Graphical Models 4190.408 2016-Spring Bayesian Networks Inference with Probabilistic Graphical Models Byoung-Tak Zhang intelligence Lab Seoul National University 4190.408 Artificial (2016-Spring) 1 Machine Learning? Learning

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation BACKPROPAGATION Neural network training optimization problem min J(w) w The application of gradient descent to this problem is called backpropagation. Backpropagation is gradient descent applied to J(w)

More information

3 : Representation of Undirected GM

3 : Representation of Undirected GM 10-708: Probabilistic Graphical Models 10-708, Spring 2016 3 : Representation of Undirected GM Lecturer: Eric P. Xing Scribes: Longqi Cai, Man-Chia Chang 1 MRF vs BN There are two types of graphical models:

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Machine Learning, Midterm Exam

Machine Learning, Midterm Exam 10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models David Sontag New York University Lecture 4, February 16, 2012 David Sontag (NYU) Graphical Models Lecture 4, February 16, 2012 1 / 27 Undirected graphical models Reminder

More information

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

Markov Chains and MCMC

Markov Chains and MCMC Markov Chains and MCMC Markov chains Let S = {1, 2,..., N} be a finite set consisting of N states. A Markov chain Y 0, Y 1, Y 2,... is a sequence of random variables, with Y t S for all points in time

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Machine Learning (CS 567) Lecture 5

Machine Learning (CS 567) Lecture 5 Machine Learning (CS 567) Lecture 5 Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

THE MOST IMPORTANT BIT

THE MOST IMPORTANT BIT NEURAL NETWORKS THE MOST IMPORTANT BIT A neural network represents a function f : R d R d 2. Peter Orbanz Applied Data Mining 262 BUILDING BLOCKS Units The basic building block is a node or unit: φ The

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo Group Prof. Daniel Cremers 11. Sampling Methods: Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning. Lili Mou Jan, 2015 The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

More information

p(d θ ) l(θ ) 1.2 x x x

p(d θ ) l(θ ) 1.2 x x x p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Undirected Graphical Models

Undirected Graphical Models Undirected Graphical Models 1 Conditional Independence Graphs Let G = (V, E) be an undirected graph with vertex set V and edge set E, and let A, B, and C be subsets of vertices. We say that C separates

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

Markov Networks.

Markov Networks. Markov Networks www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts Markov network syntax Markov network semantics Potential functions Partition function

More information

Markov chain Monte Carlo Lecture 9

Markov chain Monte Carlo Lecture 9 Markov chain Monte Carlo Lecture 9 David Sontag New York University Slides adapted from Eric Xing and Qirong Ho (CMU) Limitations of Monte Carlo Direct (unconditional) sampling Hard to get rare events

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Course Info. Instructor: Nicholas Ruozzi Office: ECSS 3.409 Office hours: Tues.

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov

More information

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

COM336: Neural Computing

COM336: Neural Computing COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information