Review. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Size: px
Start display at page:

Download "Review. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012"

Transcription

1 Review Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

2 What is Machine Learning (ML) Study of algorithms that improve their performance at some task with experience 2

3 Graphical Models Representation directed vs. undirected Conditional independence semantics Factorization Inference message passing algorithm (tree vs. general graph) Junction tree for graphs Variational inference vs. sampling Learning directed vs. undirected Fully observed vs. latent variable Structure learning 3

4 Conditional Independence Assumptions Local Markov Assumption X Nondescendant X Pa X Global Markov Assumption A B C, sep G A, B; C Nondescendant X X Pa X D-separation, active trail BN MN A Derived local and pairwise assumption C B A S (A H) A H S H N S H N H S (N H) X TheRest MB X X Y TheRest (no X Y) A S F A F (A F S) A C X B D MB X = {ABCD} 4

5 Distribution Factorization Bayesian Networks (Directed Graphical Models) I map: I l G I P P(X 1,, X n ) = P(X i Pa Xi ) n i=1 Conditional Probability Tables (CPTs) Markov Networks (Undirected Graphical Models) strictly positive P, I map: I G I P m Clique Potentials Normalization (Partition Function) P(X 1,, X n ) = 1 Z Z = i=1 m x 1,x 2,,x n i=1 Ψ i D i Ψ i D i Maximal Clique 5

6 Representation Power? P BN MN convert? Minimal I-map not unique Do not always have P-map Minimal I-map unique Do not always have P-map X 1 X 1 X 3 X 2, X 4 X 2 X 4 X 1, X 3 A F X 2 X 4 S A F (A F S) X 3 6

7 Inference in Graphical Models General form of the inference problem P X 1,, X n Ψ(D i ) i Want to query Y variable given evidence e, and don t care a set of Z variables Compute τ Y, e = Z i Ψ(D i ) using variable elimination Renormalize to obtain the conditionals P Y e = τ(y,e) Y τ(y,e) Two examples: use graph structure to order computation DAG: C B D A Chain: A B C D E E F G H 7

8 Message passing algorithm m ji X i Xj Ψ X i, X j Ψ X j s N j \i m sj X j product of incoming messages multiply by local potentials N j \i k m kj X j Sum out X j m ji X i X j can send message when incoming messages from N j \i arrive j i f l m lj X j 8

9 Junction tree algorithm for DAG B A B A B A C E D F Moralize C E D F Triangulate C E D F G H G H G H BC C DE ADE BC C DE ADE GE E CDE EFH AE AEF EF Maximum spanning tree GE E E CDE E E E EFH AE AEF EF Junction Tree Clique Graph 9

10 Message passing in junction trees m Dj D i S ji Φ D j D j \S ji m Dt D j S tj D t N D j \D i product of incoming messages multiply by local potentials Sum out variables not in separator N D j Separator: S kj = D k D j \D i D k S kj m Dk D j S kj m Dj D i S ji D j D S i ji Can also be applied to loopy clique graphs for approximate inference S lj m Dl D j S lj D l 10

11 Variational Inference What is the approximating structure???? P How to measure the goodness of the approximation of Q X 1,, X n to the original P X 1,, X n? Reverse KL-divergence KL(Q P Q Q Q How to compute the new parameters? Optimization Q = argmin Q KL(Q P mean field New Parameters 11

12 Mean Field Algorithm Initialize Q X 1,, X n = i Q X i (eg., randomly or smartly) Set all variables to unprocessed Pick an unprocessed variable X i Update Q i : Q i X i = 1 Z i exp E Q ln Ψ D j Set variable X i as processed If Q i changed D j :X i D j Set neighbors of X i to unprocessed Guaranteed to converge 12

13 Why Sampling Previous inference tasks focus on obtaining the entire posterior distribution P X i e Often we want to take expectations Mean μ Xi e = E X i e = X i P X i e dx i 2 Variance σ Xi e More general E f analytically = E (X i μ Xi e) 2 e = (X i μ Xi e) 2 P X i e dx i = f X P X e dx, can be difficult to do it Key idea: approximate expectation by sample average N E f 1 N f x i i=1 where x 1,, x N P X e independently and identically 13

14 Sampling Methods Direct Sampling Works only for easy distributions (multinomial, Gaussian etc.) Rejection Sampling Create samples like direct sampling Only count samples consistent with given evidence Importance Sampling Create samples like direct sampling Assign weights to samples Gibbs Sampling Often used for high-dimensional problem Use variables and its Markov blanket for sampling 14

15 Gibbs Sampling in formula Gibbs sampling X = x 0 For t = 1 to N x 1 t P(X 1 x 2 t 1,, x K t 1 ) x 2 t P(X 2 x 1 t, x 3 t 1, x K t 1 ) t ) x K t P(X K x 1 t,, x K 1 For graphical models Only need to condition on the Variables in the Markov blanket X 2 X 3 Variants: Randomly pick variable to sample sample block by block x t t 1, x 2 P(X 1, X 2 x t 1 3, x t 1 K ) X 1 X 4 X 5 15

16 Learning for GMs Known Structure Unknown Structure Fully observable data Relatively Easy Hard Missing data Hard (EM) Very hard Estimation principle: Maximal likelihood estimation Bayesian estimation Common Feature Make use of distribution factorization Make use of inference algorithm Make use of regularization/prior 16

17 Bayesian Parameter Estimation Bayesian treat the unknown parameters as a random variable, whose distribution can be inferred using Bayes rule: P(θ D) = P D θ P(θ) P(D) = P D θ P(θ) P D θ P θ dθ θ The crucial equation can be written in words Posterior = likelihood prior marginal likelihood X N N For iid data, the likelihood is P D θ = P(x i θ) i=1 N i=1 θ x i 1 θ 1 x i = θ i x i 1 θ i 1 x i = θ #head 1 θ #tail The prior P θ encodes our prior knowledge on the domain Different prior P θ will end up with different estimate P(θ D)! 17

18 Frequentist Parameter Estimation Bayesian estimation has been criticized for being subjective Frequentists think of a parameter as a fixed, unknown constant, not a random variable Hence different objective estimators, instead of Bayes rule These estimators have different properties, such as being unbiased, minimum variance, etc. A very popular estimator is the maximum likelihood estimator (MLE), which is simple and has good statistical properties N θ = argmax θ P D θ = argmax θ i=1 P(x i θ) 18

19 How estimators should be used? θ MAP is not Bayesian (even though it uses a prior) since it is a point estimate Consider predicting the future. A sensible way is to combine predictions based on all possible value of θ, weighted by their posterior probability, this is called Bayesian prediction: P x new D = P x new, θ D dθ = P x new θ, D P θ D dθ = P x new θ P θ D dθ A frequentist prediction will typically use a plug-in estimator such as ML/MAP P x new D = P(x new θ ML ) or P x new D = P(x new θ MAP ) θ X N X new 19

20 Decomposable likelihood of directed model l θ; D = log P D θ = log P a i i θ a + log P f i i θ f + i logp s i a i, f i, θ s + i logp(h i s i, θ h ) One term for each CPT; break up MLE problem into independent subproblems Because the factorization of the distribution, we can estimate each CPT separately. Allergy Flu Allergy Allergy Flu Flu Sinus Learn separately Sinus Sinus Headache Headache 20

21 Bayesian estimator for directed models Factorization P X = x = P x i pa Xi, θ i ) i Local CPT: multinomial distribution P X i = k Pa Xi = j = θ kj Factorized prior over parameters P θ a P θ b P θ s P(θ h ) θ a θ b Allergy Flu θ s Sinus θ h Headache 21

22 MLE Learning Algorithm for Exponential models max θ l θ, D is a convex optimization problem. Can be solve by many methods, such as gradient descent, conjugate gradient. Initialize model parameters θ Loop until convergence Compute l θ,d Update θ ij θ ij η θ ij = E P Xi,X j X i X j E P X θ X i X j l θ,d θ ij 22

23 Partially observed graphical models Mixture models and hidden Markov models 23

24 Why is learning hard? In fully observed iid settings, the log-likelihood decomposes into a sum of local terms l θ; D = log p x, z θ = log P z θ 1 + log p(x z, θ 2 ) With latent variables, all the parameters become coupled together via marginalization l θ; D = log p x, z θ z = log p(x z z, θ 2 )P z θ 1 Z Z X N X N 24

25 EM algorithm EM: Expectation-maximization for finding θ l θ; D = log p x, z θ z = log p(x i z z, θ 2 )P z θ 1 Iterate between E-step and M-step until convergence Expectation step (E-step) f θ = E q z log p x, z θ, where q z = P(z x, θ t ) Maximization step (M-step) θ t+1 = argmax θ f θ 25

26 Structure Learning The goal: given set of independent samples (assignments of random variables), find the best (the most likely) graphical model structure A F A S F A F S candidate structure N H Score structures S N H (A,F,S,N,H) = (T,F,F,T,F) (A,F,S,N,H) = (T,F,T,T,F) (A,F,S,N,H) = (F,T,T,T,T) A N S F H Maximum likelihood; Bayesian score; Margin N H 26

27 Chow-liu algorithm T = argmax T M (i,j) T I(x i, x j ) M i H(x i ) Chow-liu algorithm For each pair of variables X i, X j, compute their empirical mutual information I(x i, x j ) Now you have a complete graph connecting variable nodes, with edge weight equal to I(x i, x j ) Run maximum spanning tree algorithm 27

28 Kernel methods Kernels Similarity measure between a pair of data points Positive definite kernel matrix Design and combine kernels Fast kernel computation Kernelize algorithms Use inner product between data points to express algorithms The learned function lies is a linear combination of data points Replace inner products with kernels SVM, ridge regression, clustering, PCA, CCA, ICA, Statistical tests Gaussian processes Covariance functions are kernel functions 28

29 Support Vector Machines (SVM) 1 min w 2 w w + C j ξ j s. t. w x j + b y j 1 ξ j, ξ j 0, j ξ j : Slack variables 29

30 SVM for nonlinear problem Solve nonlinear problem with linear relation in feature space Non-linear decision boundary Linear decision boundary in feature space Transform data points Nonlinear clustering, principal component analysis, canonical correlation analysis 30

31 SVM for nonlinear problems Some problem needs complicated and even infinite features φ x = x, x 2, x 3, x 4, Explicitly computing high dimension features is time consuming, and makes subsequent optimization costly Nonlinear Decision Boundaries Linear SVM Decision Boundaries 31

32 Kernel trick The dual problem of SVMs, replace inner product by kernel Max α i α i 1 2 s. t. i α i y i = 0 0 α i C i,j α i α j y i y j φ(x i ) φ(x j ) k(x i, x j ) Corresponding kernel matrix is psd It is a quadratic programming; solve for α, then we get w = j α j y j φ(x j ) b = y k w φ(x k ) for any k such that 0 < α k < C Evaluate the decision boundary on a new data point f x = w φ x =( j α j y j φ(x j )) φ x = j α j y j k(x j, x) 32

33 Typical kernels for vector data Polynomial of degree d k x, y = x y d Polynomial of degree up to d k x, y = x y + c d Gaussian RBF kernel k x, y = exp x y 2 Laplace Kernel 2σ 2 k x, y = exp x y 2σ 2 33

34 Kernel Functions Denote the inner product as a function k x i, x j = φ x i φ x j K(, )=0.6 K(, )=0.2 Inner product maps to # node # edge # triangle # rectangle # pentagon K(, )=0.5 maps to # node # edge # triangle # rectangle # pentagon ACAAGAT GCCATTG GCCATTG K( TCCCCCG, )=0.7 GCCTCCT GCTGCTG GCATGAC ACCTGCT GGTCCTA 34

35 Combining kernels Positive weighted combination of kernels are kernels k 1 x, y and k 2 (x, y) are kernels α, β 0 Then k x, y = αk 1 x, y + βk 2 x, y is a kernel Product of kernels are kernels k 1 x, y and k 2 (x, y) are kernels Then k x, y = k 1 x, y k 2 x, y is a kernel Mapping between spaces give you kernels k x, y is a kernel, then k φ x, φ y is a kernel k x, y = x 2 y 2 35

36 Principal component analysis Given a set of M centered observations x k R d, PCA finds the direction that maximizes the variance X = x 1, x 2,, x M w = 1 argmax w 1 M k w x k 2 1 = argmax w 1 M w XX w C = 1 M XX, w can be found by solving the following eigen-value problem Cw = λ w 36

37 Alternative expression for PCA The principal component lies in the span of the data w = α k x k = Xα k Plug this in we have Cw = 1 M XX Xα = λ Xα Furthermore, for each data point x k, the following relation holds x k Cw = 1 M x k XX Xα = λ x k Xα, k In matrix form, 1 M X XX Xα = λx Xα Only depends on inner product matrix 37

38 Kernel PCA Key Idea: Replace inner product matrix by kernel matrix PCA: 1 M X XX Xα = λx Xα x k φ x k, Φ = φ x 1,, φ x k, K = Φ Φ Nonlinear component w = Φα Kernel PCA: 1 M KKα = λkα, equivalent to 1 M Kα = λ α First form an M by M kernel matrix K, and then perform eigendecomposition on K 38

39 CCA in inner product format Similar to PCA, the directions of projection lie in the span of the data X = x 1,, x m, Y = (y 1,, y m ) w x = Xα, w y = Yβ C xy = 1 m XY, C xx = 1 m XX, C yy = 1 m YY^ Earlier we have Plug in w x = Xα, w y = Yβ, we have max, T X T XX T X T T X XY T Y T Y T YY Data only appear in inner products T Y 39

40 Kernel CCA Replace inner product matrix by kernel matrix Where K x is kernel matrix for data X, with entries K x i, j = k x i, x j Solve generalized eigenvalue problem 40 y y T x x T y x T K K K K K K, max y y x x x y y x K K K K K K K K

41 Embedding with kernel features Transform distribution to infinite dimensional vector Rich representation Feature space Mean, Variance, higher order moment 41

42 Estimating embedding distance Finite sample estimator Form a kernel matrix with 4 blocks Average this block Average this block Average this block Average this block 42

43 Measure Dependence via Embeddings Use squared distance to measure dependence between X and Y Feature space [Smola, Gretton, Song and Scholkopf. 2007] Dependence measure useful for: Dimensionality reduction Clustering Matching 43

44 Estimating embedding distances Given samples (x 1, y 1 ),, (x m, y m ) P X, Y Dependence measure can be expressed as inner products μ XY μ X μ Y 2 = E XY [φ X ψ Y ] E X φ X E Y [ψ Y ] 2 =< μ XY, μ XY > 2 < μ XY, μ X μ Y >+< μ X μ Y, μ X μ Y > Kernel matrix operation (H = I 1 m 11 ) X and Y data are ordered in the same way trace( H H k(x i, x j ) k(y i, y j ) ) 44

45 Other advanced methods Combining classifiers Bagging Stacking Boosting (Adaboost) Semisupervised learning Graph-based methods (label propagation) Co-training Semisupervised SVM Active learning Tensor data decomposition Parafac and Tucker decomposition 45

46 What is Gaussian Process? A Gaussian process is a generalization of a multivariate Gaussian distribution to infinitely many variables Formally: a collection of random variables, any finite number of which have (consistent) Gaussian distributions Informally, infinitely long vector with dimensions index by x function f(x) A Gaussian process is fully specified by a mean function m x = E[f(x)] and covariance function k x, x = E f x m x f x m x f x GP m x, k x, x, x: indices 46

47 Covariance function of Gaussian processes For any finite collection of indices x 1, x 2,, x n, the covariance matrix is positive semidefinite Σ = K = k x 1, x 1 k x 1, x 2 k x 2, x 1 k x 2, x 2 k(x n, x 1 ) k(x n, x 2 ) k x 1, x n k x 2, x n k(x n, x n ) The covariance function needs to be a kernel function over the indices! Eg. Gaussian RBF kernel k x, x = exp 1 2 x x 2 47

48 Samples from GPs with different kernels k x i, x j = v 0 exp x i x j r α + v 1 + v 2 δ ij 48

49 Using Gaussian process for nonlinear regression Observing a dataset D = n x i, y i i=1 Prior P(f) is Gaussian process, like a multivariate Gaussian, therefore, posterior of f is also a Gaussian process Bayesian rule P f D = P D f P(f) P(D) Everything else about GPs follows the basic rules of probabilities applied to multivariate Gaussians 49

50 Noisy Observation y x, f x 2 N f, σ noise I, let Y = (y 2,, y n ) n f x x i, y i i=1 ~GP m post x, k post x, x 2 m post x = k x, X K + σ noise I 1 Y k post x, x = k(x, x 2 ) k x, X K + σ noise I 1 k x, X 50

51 Relate GP to class probability Transform the continuous output of Gaussian Process to a value between [-1,1] or [0,1] With binary outputs, the joint distribution of all variables in the model is no longer Gaussians The likelihood is also not Gaussian, so we will need to use approximate inference to compute the posterior GP (Laplace approximation, sampling) 51

52 Kernel low rank approximation Incomplete Cholesky factorization of kernel matrix K of size n n to R of size d n, and d n K R R R A R n f x x i, y i i=1 ~GP m post x, k post x, x m post x = R x RR 2 + σ noise I 1 RY k post x, x = R xx R x RR 2 + σ noise I 1 (RR )R x 52

53 Incomplete Cholesky Decomposition We have a few things to understand Gram-Schmidt orthogonalization Given a set of vectors V = {v 1, v 2,, v n }, find a set of orthonormal basis Q = u 1, u 2, u n, u i u j = 0, u i u i = 0 QR decomposition Given a set of orthonormal basis Q, compute the projection of V onto Q, v i = j r ji u j, R = r ji V = QR Cholesky decomposition with pivots V Q :, 1: k R 1: k, Kernelization V V = R Q QR = R R R 1: k, R 1: k, K = Φ Φ R 1: k, R 1: k, 53

54 Incomplete Cholesky decomposition: Matlab Kernel entries can be computed on the fly Computation O nd 2 number of kernel evaluation 54

55 Random features What basis to use? e jω (x y) can be replaced by cos (ω x y ) since both k x y and p ω real functions cos ω x y = cos ωx cos ωy + sin ωx sin ωy For each ω, use feature [cos ωx, sin ωx ] What randomness to use? Randomly draw ω from p ω Eg. Gaussian RBF kernel, drawn from Gaussian 55

56 Other advanced methods Combining classifiers Bagging, Stacking, Boosting (adaboost) Semi-supervised learning Graph-based methods (label propagation), Co-training, Semisupervised SVM Active Learning Active learning for SVM Tensor data analysis Parafac and Tucker decomposition Connection with latent variable models 56

57 Bagging Bagging: Bootstrap aggregating Generate B bootstrap samples of the training data: uniformly random sampling with replacement Train a classifier or a regression function using each bootstrap sample For classification: majority vote on the classification results For regression: average on the predicted values Original Training set Training set Training set Training set

58 Stacking classifiers Level-0 models are based on different learning models and use original data (level-0 data) Level-1 models are based on results of level-0 models (level-1 data are outputs of level-0 models) -- also called generalizer If you have lots of models, you can stacking into deeper hierarchies 58

59 Boosting Boosting: general methods of converting rough rules of thumb into highly accurate prediction rule A family of methods which produce a sequence of classifiers Each classifier is dependent on the previous one and focuses on the previous one s errors Examples that are incorrectly predicted in the previous classifiers are chosen more often or weighted more heavily when estimating a new classifier. Questions: How to choose hardest examples? How to combine these classifiers? 59

60 Adaboost flow chart Original training set training instances that are wrongly predicted by Learner 1 will play more important roles in the training of Learner 2 Data set 1 Data set 2... Data set T... Learner 1 Learner 2... Learner T weighted combination 60

61 AdaBoost 61

62 Graph-based methods Idea: construct a graph with edges between very similar examples Unlabeled data can help glue the objects of the same class together Suppose just two labels: 0 & 1. Solve for labels f(x) for unlabeled examples x to minimize: Label propagation: average of neighbor labels Minimum cut e=(u,v) f(u)-f(v) Minimum soft-cut e=(u,v) (f(u)-f(v)) 2 Spectral partitioning

63 Passive Learning (Non-sequential Design) Data Learning Source Algorithm (estimator) Expert / Oracle Labeled data points Algorithm outputs a classifier 63

64 Active Learning (Sequential Design) Learning Algorithm Data Source Expert / Oracle Request for the label of a data point The label of that point Request for the label of another data point The label of that point... Algorithm outputs a classifier 64

65 Active Learning (Sequential Design) Learning Algorithm Data Source Expert / Oracle Request for the label of a data point The label of that point Request for the label of another data point The label of that point... Algorithm outputs a classifier How many label requests are required to learn? Label Complexity 65

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;

More information

Kernel methods for comparing distributions, measuring dependence

Kernel methods for comparing distributions, measuring dependence Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Kernel PCA, clustering and canonical correlation analysis

Kernel PCA, clustering and canonical correlation analysis ernel PCA, clustering and canonical correlation analsis Le Song Machine Learning II: Advanced opics CSE 8803ML, Spring 2012 Support Vector Machines (SVM) 1 min w 2 w w + C j ξ j s. t. w j + b j 1 ξ j,

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

Message Passing Algorithms and Junction Tree Algorithms

Message Passing Algorithms and Junction Tree Algorithms Message Passing lgorithms and Junction Tree lgorithms Le Song Machine Learning II: dvanced Topics S 8803ML, Spring 2012 Inference in raphical Models eneral form of the inference problem P X 1,, X n Ψ(

More information

Graphical Models and Kernel Methods

Graphical Models and Kernel Methods Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2 Yasemin Altun January 26, 2007 Review of Inference on Graphical Models Elimination algorithm finds single

More information

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4 ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

Probabilistic Graphical Models (I)

Probabilistic Graphical Models (I) Probabilistic Graphical Models (I) Hongxin Zhang zhx@cad.zju.edu.cn State Key Lab of CAD&CG, ZJU 2015-03-31 Probabilistic Graphical Models Modeling many real-world problems => a large number of random

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori

More information

Directed Graphical Models or Bayesian Networks

Directed Graphical Models or Bayesian Networks Directed Graphical Models or Bayesian Networks Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Bayesian Networks One of the most exciting recent advancements in statistical AI Compact

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April Probabilistic Graphical Models Guest Lecture by Narges Razavian Machine Learning Class April 14 2017 Today What is probabilistic graphical model and why it is useful? Bayesian Networks Basic Inference

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1

More information

GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Learning MN Parameters with Approximation. Sargur Srihari

Learning MN Parameters with Approximation. Sargur Srihari Learning MN Parameters with Approximation Sargur srihari@cedar.buffalo.edu 1 Topics Iterative exact learning of MN parameters Difficulty with exact methods Approximate methods Approximate Inference Belief

More information

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Inference in Graphical Models Variable Elimination and Message Passing Algorithm Inference in Graphical Models Variable Elimination and Message Passing lgorithm Le Song Machine Learning II: dvanced Topics SE 8803ML, Spring 2012 onditional Independence ssumptions Local Markov ssumption

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

Variational Inference (11/04/13)

Variational Inference (11/04/13) STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further

More information

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

3 : Representation of Undirected GM

3 : Representation of Undirected GM 10-708: Probabilistic Graphical Models 10-708, Spring 2016 3 : Representation of Undirected GM Lecturer: Eric P. Xing Scribes: Longqi Cai, Man-Chia Chang 1 MRF vs BN There are two types of graphical models:

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning Topics Summary of Class Advanced Topics Dhruv Batra Virginia Tech HW1 Grades Mean: 28.5/38 ~= 74.9%

More information

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated Fall 3 Computer Vision Overview of Statistical Tools Statistical Inference Haibin Ling Observation inference Decision Prior knowledge http://www.dabi.temple.edu/~hbling/teaching/3f_5543/index.html Bayesian

More information

Jeff Howbert Introduction to Machine Learning Winter

Jeff Howbert Introduction to Machine Learning Winter Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

11. Learning graphical models

11. Learning graphical models Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Multivariate Gaussians Mark Schmidt University of British Columbia Winter 2019 Last Time: Multivariate Gaussian http://personal.kenyon.edu/hartlaub/mellonproject/bivariate2.html

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Undirected Graphical Models: Markov Random Fields

Undirected Graphical Models: Markov Random Fields Undirected Graphical Models: Markov Random Fields 40-956 Advanced Topics in AI: Probabilistic Graphical Models Sharif University of Technology Soleymani Spring 2015 Markov Random Field Structure: undirected

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning 10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

Supervised Learning Coursework

Supervised Learning Coursework Supervised Learning Coursework John Shawe-Taylor Tom Diethe Dorota Glowacka November 30, 2009; submission date: noon December 18, 2009 Abstract Using a series of synthetic examples, in this exercise session

More information

Introduction to Gaussian Process

Introduction to Gaussian Process Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Dimensionality Reduction Goal:

More information

Lecture 8: Bayesian Networks

Lecture 8: Bayesian Networks Lecture 8: Bayesian Networks Bayesian Networks Inference in Bayesian Networks COMP-652 and ECSE 608, Lecture 8 - January 31, 2017 1 Bayes nets P(E) E=1 E=0 0.005 0.995 E B P(B) B=1 B=0 0.01 0.99 E=0 E=1

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

More information

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 12 Jan-Willem van de Meent (credit: Yijun Zhao, Percy Liang) DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Linear Dimensionality

More information

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings: K&F: 16.3, 16.4, 17.3 Bayesian Param. Learning Bayesian Structure Learning Graphical Models 10708 Carlos Guestrin Carnegie Mellon University October 6 th, 2008 10-708 Carlos Guestrin 2006-2008

More information

Active and Semi-supervised Kernel Classification

Active and Semi-supervised Kernel Classification Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Introduction to SVM and RVM

Introduction to SVM and RVM Introduction to SVM and RVM Machine Learning Seminar HUS HVL UIB Yushu Li, UIB Overview Support vector machine SVM First introduced by Vapnik, et al. 1992 Several literature and wide applications Relevance

More information

Machine Learning, Midterm Exam

Machine Learning, Midterm Exam 10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

An Introduction to Bayesian Machine Learning

An Introduction to Bayesian Machine Learning 1 An Introduction to Bayesian Machine Learning José Miguel Hernández-Lobato Department of Engineering, Cambridge University April 8, 2013 2 What is Machine Learning? The design of computational systems

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2016 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Qualifier: CS 6375 Machine Learning Spring 2015

Qualifier: CS 6375 Machine Learning Spring 2015 Qualifier: CS 6375 Machine Learning Spring 2015 The exam is closed book. You are allowed to use two double-sided cheat sheets and a calculator. If you run out of room for an answer, use an additional sheet

More information