Review. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
|
|
- John Cooper
- 5 years ago
- Views:
Transcription
1 Review Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
2 What is Machine Learning (ML) Study of algorithms that improve their performance at some task with experience 2
3 Graphical Models Representation directed vs. undirected Conditional independence semantics Factorization Inference message passing algorithm (tree vs. general graph) Junction tree for graphs Variational inference vs. sampling Learning directed vs. undirected Fully observed vs. latent variable Structure learning 3
4 Conditional Independence Assumptions Local Markov Assumption X Nondescendant X Pa X Global Markov Assumption A B C, sep G A, B; C Nondescendant X X Pa X D-separation, active trail BN MN A Derived local and pairwise assumption C B A S (A H) A H S H N S H N H S (N H) X TheRest MB X X Y TheRest (no X Y) A S F A F (A F S) A C X B D MB X = {ABCD} 4
5 Distribution Factorization Bayesian Networks (Directed Graphical Models) I map: I l G I P P(X 1,, X n ) = P(X i Pa Xi ) n i=1 Conditional Probability Tables (CPTs) Markov Networks (Undirected Graphical Models) strictly positive P, I map: I G I P m Clique Potentials Normalization (Partition Function) P(X 1,, X n ) = 1 Z Z = i=1 m x 1,x 2,,x n i=1 Ψ i D i Ψ i D i Maximal Clique 5
6 Representation Power? P BN MN convert? Minimal I-map not unique Do not always have P-map Minimal I-map unique Do not always have P-map X 1 X 1 X 3 X 2, X 4 X 2 X 4 X 1, X 3 A F X 2 X 4 S A F (A F S) X 3 6
7 Inference in Graphical Models General form of the inference problem P X 1,, X n Ψ(D i ) i Want to query Y variable given evidence e, and don t care a set of Z variables Compute τ Y, e = Z i Ψ(D i ) using variable elimination Renormalize to obtain the conditionals P Y e = τ(y,e) Y τ(y,e) Two examples: use graph structure to order computation DAG: C B D A Chain: A B C D E E F G H 7
8 Message passing algorithm m ji X i Xj Ψ X i, X j Ψ X j s N j \i m sj X j product of incoming messages multiply by local potentials N j \i k m kj X j Sum out X j m ji X i X j can send message when incoming messages from N j \i arrive j i f l m lj X j 8
9 Junction tree algorithm for DAG B A B A B A C E D F Moralize C E D F Triangulate C E D F G H G H G H BC C DE ADE BC C DE ADE GE E CDE EFH AE AEF EF Maximum spanning tree GE E E CDE E E E EFH AE AEF EF Junction Tree Clique Graph 9
10 Message passing in junction trees m Dj D i S ji Φ D j D j \S ji m Dt D j S tj D t N D j \D i product of incoming messages multiply by local potentials Sum out variables not in separator N D j Separator: S kj = D k D j \D i D k S kj m Dk D j S kj m Dj D i S ji D j D S i ji Can also be applied to loopy clique graphs for approximate inference S lj m Dl D j S lj D l 10
11 Variational Inference What is the approximating structure???? P How to measure the goodness of the approximation of Q X 1,, X n to the original P X 1,, X n? Reverse KL-divergence KL(Q P Q Q Q How to compute the new parameters? Optimization Q = argmin Q KL(Q P mean field New Parameters 11
12 Mean Field Algorithm Initialize Q X 1,, X n = i Q X i (eg., randomly or smartly) Set all variables to unprocessed Pick an unprocessed variable X i Update Q i : Q i X i = 1 Z i exp E Q ln Ψ D j Set variable X i as processed If Q i changed D j :X i D j Set neighbors of X i to unprocessed Guaranteed to converge 12
13 Why Sampling Previous inference tasks focus on obtaining the entire posterior distribution P X i e Often we want to take expectations Mean μ Xi e = E X i e = X i P X i e dx i 2 Variance σ Xi e More general E f analytically = E (X i μ Xi e) 2 e = (X i μ Xi e) 2 P X i e dx i = f X P X e dx, can be difficult to do it Key idea: approximate expectation by sample average N E f 1 N f x i i=1 where x 1,, x N P X e independently and identically 13
14 Sampling Methods Direct Sampling Works only for easy distributions (multinomial, Gaussian etc.) Rejection Sampling Create samples like direct sampling Only count samples consistent with given evidence Importance Sampling Create samples like direct sampling Assign weights to samples Gibbs Sampling Often used for high-dimensional problem Use variables and its Markov blanket for sampling 14
15 Gibbs Sampling in formula Gibbs sampling X = x 0 For t = 1 to N x 1 t P(X 1 x 2 t 1,, x K t 1 ) x 2 t P(X 2 x 1 t, x 3 t 1, x K t 1 ) t ) x K t P(X K x 1 t,, x K 1 For graphical models Only need to condition on the Variables in the Markov blanket X 2 X 3 Variants: Randomly pick variable to sample sample block by block x t t 1, x 2 P(X 1, X 2 x t 1 3, x t 1 K ) X 1 X 4 X 5 15
16 Learning for GMs Known Structure Unknown Structure Fully observable data Relatively Easy Hard Missing data Hard (EM) Very hard Estimation principle: Maximal likelihood estimation Bayesian estimation Common Feature Make use of distribution factorization Make use of inference algorithm Make use of regularization/prior 16
17 Bayesian Parameter Estimation Bayesian treat the unknown parameters as a random variable, whose distribution can be inferred using Bayes rule: P(θ D) = P D θ P(θ) P(D) = P D θ P(θ) P D θ P θ dθ θ The crucial equation can be written in words Posterior = likelihood prior marginal likelihood X N N For iid data, the likelihood is P D θ = P(x i θ) i=1 N i=1 θ x i 1 θ 1 x i = θ i x i 1 θ i 1 x i = θ #head 1 θ #tail The prior P θ encodes our prior knowledge on the domain Different prior P θ will end up with different estimate P(θ D)! 17
18 Frequentist Parameter Estimation Bayesian estimation has been criticized for being subjective Frequentists think of a parameter as a fixed, unknown constant, not a random variable Hence different objective estimators, instead of Bayes rule These estimators have different properties, such as being unbiased, minimum variance, etc. A very popular estimator is the maximum likelihood estimator (MLE), which is simple and has good statistical properties N θ = argmax θ P D θ = argmax θ i=1 P(x i θ) 18
19 How estimators should be used? θ MAP is not Bayesian (even though it uses a prior) since it is a point estimate Consider predicting the future. A sensible way is to combine predictions based on all possible value of θ, weighted by their posterior probability, this is called Bayesian prediction: P x new D = P x new, θ D dθ = P x new θ, D P θ D dθ = P x new θ P θ D dθ A frequentist prediction will typically use a plug-in estimator such as ML/MAP P x new D = P(x new θ ML ) or P x new D = P(x new θ MAP ) θ X N X new 19
20 Decomposable likelihood of directed model l θ; D = log P D θ = log P a i i θ a + log P f i i θ f + i logp s i a i, f i, θ s + i logp(h i s i, θ h ) One term for each CPT; break up MLE problem into independent subproblems Because the factorization of the distribution, we can estimate each CPT separately. Allergy Flu Allergy Allergy Flu Flu Sinus Learn separately Sinus Sinus Headache Headache 20
21 Bayesian estimator for directed models Factorization P X = x = P x i pa Xi, θ i ) i Local CPT: multinomial distribution P X i = k Pa Xi = j = θ kj Factorized prior over parameters P θ a P θ b P θ s P(θ h ) θ a θ b Allergy Flu θ s Sinus θ h Headache 21
22 MLE Learning Algorithm for Exponential models max θ l θ, D is a convex optimization problem. Can be solve by many methods, such as gradient descent, conjugate gradient. Initialize model parameters θ Loop until convergence Compute l θ,d Update θ ij θ ij η θ ij = E P Xi,X j X i X j E P X θ X i X j l θ,d θ ij 22
23 Partially observed graphical models Mixture models and hidden Markov models 23
24 Why is learning hard? In fully observed iid settings, the log-likelihood decomposes into a sum of local terms l θ; D = log p x, z θ = log P z θ 1 + log p(x z, θ 2 ) With latent variables, all the parameters become coupled together via marginalization l θ; D = log p x, z θ z = log p(x z z, θ 2 )P z θ 1 Z Z X N X N 24
25 EM algorithm EM: Expectation-maximization for finding θ l θ; D = log p x, z θ z = log p(x i z z, θ 2 )P z θ 1 Iterate between E-step and M-step until convergence Expectation step (E-step) f θ = E q z log p x, z θ, where q z = P(z x, θ t ) Maximization step (M-step) θ t+1 = argmax θ f θ 25
26 Structure Learning The goal: given set of independent samples (assignments of random variables), find the best (the most likely) graphical model structure A F A S F A F S candidate structure N H Score structures S N H (A,F,S,N,H) = (T,F,F,T,F) (A,F,S,N,H) = (T,F,T,T,F) (A,F,S,N,H) = (F,T,T,T,T) A N S F H Maximum likelihood; Bayesian score; Margin N H 26
27 Chow-liu algorithm T = argmax T M (i,j) T I(x i, x j ) M i H(x i ) Chow-liu algorithm For each pair of variables X i, X j, compute their empirical mutual information I(x i, x j ) Now you have a complete graph connecting variable nodes, with edge weight equal to I(x i, x j ) Run maximum spanning tree algorithm 27
28 Kernel methods Kernels Similarity measure between a pair of data points Positive definite kernel matrix Design and combine kernels Fast kernel computation Kernelize algorithms Use inner product between data points to express algorithms The learned function lies is a linear combination of data points Replace inner products with kernels SVM, ridge regression, clustering, PCA, CCA, ICA, Statistical tests Gaussian processes Covariance functions are kernel functions 28
29 Support Vector Machines (SVM) 1 min w 2 w w + C j ξ j s. t. w x j + b y j 1 ξ j, ξ j 0, j ξ j : Slack variables 29
30 SVM for nonlinear problem Solve nonlinear problem with linear relation in feature space Non-linear decision boundary Linear decision boundary in feature space Transform data points Nonlinear clustering, principal component analysis, canonical correlation analysis 30
31 SVM for nonlinear problems Some problem needs complicated and even infinite features φ x = x, x 2, x 3, x 4, Explicitly computing high dimension features is time consuming, and makes subsequent optimization costly Nonlinear Decision Boundaries Linear SVM Decision Boundaries 31
32 Kernel trick The dual problem of SVMs, replace inner product by kernel Max α i α i 1 2 s. t. i α i y i = 0 0 α i C i,j α i α j y i y j φ(x i ) φ(x j ) k(x i, x j ) Corresponding kernel matrix is psd It is a quadratic programming; solve for α, then we get w = j α j y j φ(x j ) b = y k w φ(x k ) for any k such that 0 < α k < C Evaluate the decision boundary on a new data point f x = w φ x =( j α j y j φ(x j )) φ x = j α j y j k(x j, x) 32
33 Typical kernels for vector data Polynomial of degree d k x, y = x y d Polynomial of degree up to d k x, y = x y + c d Gaussian RBF kernel k x, y = exp x y 2 Laplace Kernel 2σ 2 k x, y = exp x y 2σ 2 33
34 Kernel Functions Denote the inner product as a function k x i, x j = φ x i φ x j K(, )=0.6 K(, )=0.2 Inner product maps to # node # edge # triangle # rectangle # pentagon K(, )=0.5 maps to # node # edge # triangle # rectangle # pentagon ACAAGAT GCCATTG GCCATTG K( TCCCCCG, )=0.7 GCCTCCT GCTGCTG GCATGAC ACCTGCT GGTCCTA 34
35 Combining kernels Positive weighted combination of kernels are kernels k 1 x, y and k 2 (x, y) are kernels α, β 0 Then k x, y = αk 1 x, y + βk 2 x, y is a kernel Product of kernels are kernels k 1 x, y and k 2 (x, y) are kernels Then k x, y = k 1 x, y k 2 x, y is a kernel Mapping between spaces give you kernels k x, y is a kernel, then k φ x, φ y is a kernel k x, y = x 2 y 2 35
36 Principal component analysis Given a set of M centered observations x k R d, PCA finds the direction that maximizes the variance X = x 1, x 2,, x M w = 1 argmax w 1 M k w x k 2 1 = argmax w 1 M w XX w C = 1 M XX, w can be found by solving the following eigen-value problem Cw = λ w 36
37 Alternative expression for PCA The principal component lies in the span of the data w = α k x k = Xα k Plug this in we have Cw = 1 M XX Xα = λ Xα Furthermore, for each data point x k, the following relation holds x k Cw = 1 M x k XX Xα = λ x k Xα, k In matrix form, 1 M X XX Xα = λx Xα Only depends on inner product matrix 37
38 Kernel PCA Key Idea: Replace inner product matrix by kernel matrix PCA: 1 M X XX Xα = λx Xα x k φ x k, Φ = φ x 1,, φ x k, K = Φ Φ Nonlinear component w = Φα Kernel PCA: 1 M KKα = λkα, equivalent to 1 M Kα = λ α First form an M by M kernel matrix K, and then perform eigendecomposition on K 38
39 CCA in inner product format Similar to PCA, the directions of projection lie in the span of the data X = x 1,, x m, Y = (y 1,, y m ) w x = Xα, w y = Yβ C xy = 1 m XY, C xx = 1 m XX, C yy = 1 m YY^ Earlier we have Plug in w x = Xα, w y = Yβ, we have max, T X T XX T X T T X XY T Y T Y T YY Data only appear in inner products T Y 39
40 Kernel CCA Replace inner product matrix by kernel matrix Where K x is kernel matrix for data X, with entries K x i, j = k x i, x j Solve generalized eigenvalue problem 40 y y T x x T y x T K K K K K K, max y y x x x y y x K K K K K K K K
41 Embedding with kernel features Transform distribution to infinite dimensional vector Rich representation Feature space Mean, Variance, higher order moment 41
42 Estimating embedding distance Finite sample estimator Form a kernel matrix with 4 blocks Average this block Average this block Average this block Average this block 42
43 Measure Dependence via Embeddings Use squared distance to measure dependence between X and Y Feature space [Smola, Gretton, Song and Scholkopf. 2007] Dependence measure useful for: Dimensionality reduction Clustering Matching 43
44 Estimating embedding distances Given samples (x 1, y 1 ),, (x m, y m ) P X, Y Dependence measure can be expressed as inner products μ XY μ X μ Y 2 = E XY [φ X ψ Y ] E X φ X E Y [ψ Y ] 2 =< μ XY, μ XY > 2 < μ XY, μ X μ Y >+< μ X μ Y, μ X μ Y > Kernel matrix operation (H = I 1 m 11 ) X and Y data are ordered in the same way trace( H H k(x i, x j ) k(y i, y j ) ) 44
45 Other advanced methods Combining classifiers Bagging Stacking Boosting (Adaboost) Semisupervised learning Graph-based methods (label propagation) Co-training Semisupervised SVM Active learning Tensor data decomposition Parafac and Tucker decomposition 45
46 What is Gaussian Process? A Gaussian process is a generalization of a multivariate Gaussian distribution to infinitely many variables Formally: a collection of random variables, any finite number of which have (consistent) Gaussian distributions Informally, infinitely long vector with dimensions index by x function f(x) A Gaussian process is fully specified by a mean function m x = E[f(x)] and covariance function k x, x = E f x m x f x m x f x GP m x, k x, x, x: indices 46
47 Covariance function of Gaussian processes For any finite collection of indices x 1, x 2,, x n, the covariance matrix is positive semidefinite Σ = K = k x 1, x 1 k x 1, x 2 k x 2, x 1 k x 2, x 2 k(x n, x 1 ) k(x n, x 2 ) k x 1, x n k x 2, x n k(x n, x n ) The covariance function needs to be a kernel function over the indices! Eg. Gaussian RBF kernel k x, x = exp 1 2 x x 2 47
48 Samples from GPs with different kernels k x i, x j = v 0 exp x i x j r α + v 1 + v 2 δ ij 48
49 Using Gaussian process for nonlinear regression Observing a dataset D = n x i, y i i=1 Prior P(f) is Gaussian process, like a multivariate Gaussian, therefore, posterior of f is also a Gaussian process Bayesian rule P f D = P D f P(f) P(D) Everything else about GPs follows the basic rules of probabilities applied to multivariate Gaussians 49
50 Noisy Observation y x, f x 2 N f, σ noise I, let Y = (y 2,, y n ) n f x x i, y i i=1 ~GP m post x, k post x, x 2 m post x = k x, X K + σ noise I 1 Y k post x, x = k(x, x 2 ) k x, X K + σ noise I 1 k x, X 50
51 Relate GP to class probability Transform the continuous output of Gaussian Process to a value between [-1,1] or [0,1] With binary outputs, the joint distribution of all variables in the model is no longer Gaussians The likelihood is also not Gaussian, so we will need to use approximate inference to compute the posterior GP (Laplace approximation, sampling) 51
52 Kernel low rank approximation Incomplete Cholesky factorization of kernel matrix K of size n n to R of size d n, and d n K R R R A R n f x x i, y i i=1 ~GP m post x, k post x, x m post x = R x RR 2 + σ noise I 1 RY k post x, x = R xx R x RR 2 + σ noise I 1 (RR )R x 52
53 Incomplete Cholesky Decomposition We have a few things to understand Gram-Schmidt orthogonalization Given a set of vectors V = {v 1, v 2,, v n }, find a set of orthonormal basis Q = u 1, u 2, u n, u i u j = 0, u i u i = 0 QR decomposition Given a set of orthonormal basis Q, compute the projection of V onto Q, v i = j r ji u j, R = r ji V = QR Cholesky decomposition with pivots V Q :, 1: k R 1: k, Kernelization V V = R Q QR = R R R 1: k, R 1: k, K = Φ Φ R 1: k, R 1: k, 53
54 Incomplete Cholesky decomposition: Matlab Kernel entries can be computed on the fly Computation O nd 2 number of kernel evaluation 54
55 Random features What basis to use? e jω (x y) can be replaced by cos (ω x y ) since both k x y and p ω real functions cos ω x y = cos ωx cos ωy + sin ωx sin ωy For each ω, use feature [cos ωx, sin ωx ] What randomness to use? Randomly draw ω from p ω Eg. Gaussian RBF kernel, drawn from Gaussian 55
56 Other advanced methods Combining classifiers Bagging, Stacking, Boosting (adaboost) Semi-supervised learning Graph-based methods (label propagation), Co-training, Semisupervised SVM Active Learning Active learning for SVM Tensor data analysis Parafac and Tucker decomposition Connection with latent variable models 56
57 Bagging Bagging: Bootstrap aggregating Generate B bootstrap samples of the training data: uniformly random sampling with replacement Train a classifier or a regression function using each bootstrap sample For classification: majority vote on the classification results For regression: average on the predicted values Original Training set Training set Training set Training set
58 Stacking classifiers Level-0 models are based on different learning models and use original data (level-0 data) Level-1 models are based on results of level-0 models (level-1 data are outputs of level-0 models) -- also called generalizer If you have lots of models, you can stacking into deeper hierarchies 58
59 Boosting Boosting: general methods of converting rough rules of thumb into highly accurate prediction rule A family of methods which produce a sequence of classifiers Each classifier is dependent on the previous one and focuses on the previous one s errors Examples that are incorrectly predicted in the previous classifiers are chosen more often or weighted more heavily when estimating a new classifier. Questions: How to choose hardest examples? How to combine these classifiers? 59
60 Adaboost flow chart Original training set training instances that are wrongly predicted by Learner 1 will play more important roles in the training of Learner 2 Data set 1 Data set 2... Data set T... Learner 1 Learner 2... Learner T weighted combination 60
61 AdaBoost 61
62 Graph-based methods Idea: construct a graph with edges between very similar examples Unlabeled data can help glue the objects of the same class together Suppose just two labels: 0 & 1. Solve for labels f(x) for unlabeled examples x to minimize: Label propagation: average of neighbor labels Minimum cut e=(u,v) f(u)-f(v) Minimum soft-cut e=(u,v) (f(u)-f(v)) 2 Spectral partitioning
63 Passive Learning (Non-sequential Design) Data Learning Source Algorithm (estimator) Expert / Oracle Labeled data points Algorithm outputs a classifier 63
64 Active Learning (Sequential Design) Learning Algorithm Data Source Expert / Oracle Request for the label of a data point The label of that point Request for the label of another data point The label of that point... Algorithm outputs a classifier 64
65 Active Learning (Sequential Design) Learning Algorithm Data Source Expert / Oracle Request for the label of a data point The label of that point Request for the label of another data point The label of that point... Algorithm outputs a classifier How many label requests are required to learn? Label Complexity 65
Kernel methods, kernel SVM and ridge regression
Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;
More informationKernel methods for comparing distributions, measuring dependence
Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More informationKernel PCA, clustering and canonical correlation analysis
ernel PCA, clustering and canonical correlation analsis Le Song Machine Learning II: Advanced opics CSE 8803ML, Spring 2012 Support Vector Machines (SVM) 1 min w 2 w w + C j ξ j s. t. w j + b j 1 ξ j,
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft
More informationMessage Passing Algorithms and Junction Tree Algorithms
Message Passing lgorithms and Junction Tree lgorithms Le Song Machine Learning II: dvanced Topics S 8803ML, Spring 2012 Inference in raphical Models eneral form of the inference problem P X 1,, X n Ψ(
More informationGraphical Models and Kernel Methods
Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationGenerative and Discriminative Approaches to Graphical Models CMSC Topics in AI
Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2 Yasemin Altun January 26, 2007 Review of Inference on Graphical Models Elimination algorithm finds single
More informationECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4
ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationLecture 9: PGM Learning
13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and
More informationCS534 Machine Learning - Spring Final Exam
CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the
More informationBased on slides by Richard Zemel
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we
More informationProbabilistic Graphical Models (I)
Probabilistic Graphical Models (I) Hongxin Zhang zhx@cad.zju.edu.cn State Key Lab of CAD&CG, ZJU 2015-03-31 Probabilistic Graphical Models Modeling many real-world problems => a large number of random
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationNotes on Machine Learning for and
Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Choosing Hypotheses Generally want the most probable hypothesis given the training data Maximum a posteriori
More informationDirected Graphical Models or Bayesian Networks
Directed Graphical Models or Bayesian Networks Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Bayesian Networks One of the most exciting recent advancements in statistical AI Compact
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationp L yi z n m x N n xi
y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationMACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA
1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More informationProbabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April
Probabilistic Graphical Models Guest Lecture by Narges Razavian Machine Learning Class April 14 2017 Today What is probabilistic graphical model and why it is useful? Bayesian Networks Basic Inference
More informationFINAL: CS 6375 (Machine Learning) Fall 2014
FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for
More informationChris Bishop s PRML Ch. 8: Graphical Models
Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular
More informationCS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas
CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1
More informationGAUSSIAN PROCESS REGRESSION
GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationLearning MN Parameters with Approximation. Sargur Srihari
Learning MN Parameters with Approximation Sargur srihari@cedar.buffalo.edu 1 Topics Iterative exact learning of MN parameters Difficulty with exact methods Approximate methods Approximate Inference Belief
More informationInference in Graphical Models Variable Elimination and Message Passing Algorithm
Inference in Graphical Models Variable Elimination and Message Passing lgorithm Le Song Machine Learning II: dvanced Topics SE 8803ML, Spring 2012 onditional Independence ssumptions Local Markov ssumption
More informationPCA, Kernel PCA, ICA
PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per
More informationVariational Inference (11/04/13)
STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further
More informationCOMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017
COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More information3 : Representation of Undirected GM
10-708: Probabilistic Graphical Models 10-708, Spring 2016 3 : Representation of Undirected GM Lecturer: Eric P. Xing Scribes: Longqi Cai, Man-Chia Chang 1 MRF vs BN There are two types of graphical models:
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationIntroduction to Machine Learning Midterm, Tues April 8
Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationGenerative Clustering, Topic Modeling, & Bayesian Inference
Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationLecture 13 : Variational Inference: Mean Field Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1
More informationUndirected Graphical Models
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional
More informationMidterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian
More informationECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning
ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning Topics Summary of Class Advanced Topics Dhruv Batra Virginia Tech HW1 Grades Mean: 28.5/38 ~= 74.9%
More informationOverview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated
Fall 3 Computer Vision Overview of Statistical Tools Statistical Inference Haibin Ling Observation inference Decision Prior knowledge http://www.dabi.temple.edu/~hbling/teaching/3f_5543/index.html Bayesian
More informationJeff Howbert Introduction to Machine Learning Winter
Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More information11. Learning graphical models
Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical
More informationIntroduction to Machine Learning Midterm Exam Solutions
10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationProbabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov
Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly
More information9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering
Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Multivariate Gaussians Mark Schmidt University of British Columbia Winter 2019 Last Time: Multivariate Gaussian http://personal.kenyon.edu/hartlaub/mellonproject/bivariate2.html
More informationFinal Exam, Machine Learning, Spring 2009
Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationIntroduction to Machine Learning Midterm Exam
10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but
More informationUndirected Graphical Models: Markov Random Fields
Undirected Graphical Models: Markov Random Fields 40-956 Advanced Topics in AI: Probabilistic Graphical Models Sharif University of Technology Soleymani Spring 2015 Markov Random Field Structure: undirected
More informationLearning Bayesian network : Given structure and completely observed data
Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a
More informationGaussian Process Regression
Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September
More informationMachine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /
Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical
More informationAdvanced Introduction to Machine Learning
10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters
More informationSupervised Learning Coursework
Supervised Learning Coursework John Shawe-Taylor Tom Diethe Dorota Glowacka November 30, 2009; submission date: noon December 18, 2009 Abstract Using a series of synthetic examples, in this exercise session
More informationIntroduction to Gaussian Process
Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression
More informationThe Expectation-Maximization Algorithm
1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable
More informationUnsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent
Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Dimensionality Reduction Goal:
More informationLecture 8: Bayesian Networks
Lecture 8: Bayesian Networks Bayesian Networks Inference in Bayesian Networks COMP-652 and ECSE 608, Lecture 8 - January 31, 2017 1 Bayes nets P(E) E=1 E=0 0.005 0.995 E B P(B) B=1 B=0 0.01 0.99 E=0 E=1
More informationApproximate Inference Part 1 of 2
Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory
More informationClustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning
Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 12 Jan-Willem van de Meent (credit: Yijun Zhao, Percy Liang) DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Linear Dimensionality
More informationReadings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008
Readings: K&F: 16.3, 16.4, 17.3 Bayesian Param. Learning Bayesian Structure Learning Graphical Models 10708 Carlos Guestrin Carnegie Mellon University October 6 th, 2008 10-708 Carlos Guestrin 2006-2008
More informationActive and Semi-supervised Kernel Classification
Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),
More informationUnsupervised Learning
Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College
More informationClustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.
Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write
More informationBayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationIntroduction to SVM and RVM
Introduction to SVM and RVM Machine Learning Seminar HUS HVL UIB Yushu Li, UIB Overview Support vector machine SVM First introduced by Vapnik, et al. 1992 Several literature and wide applications Relevance
More informationMachine Learning, Midterm Exam
10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have
More informationStatistical Machine Learning
Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x
More informationAn Introduction to Bayesian Machine Learning
1 An Introduction to Bayesian Machine Learning José Miguel Hernández-Lobato Department of Engineering, Cambridge University April 8, 2013 2 What is Machine Learning? The design of computational systems
More informationComputer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo
Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2016 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationQualifier: CS 6375 Machine Learning Spring 2015
Qualifier: CS 6375 Machine Learning Spring 2015 The exam is closed book. You are allowed to use two double-sided cheat sheets and a calculator. If you run out of room for an answer, use an additional sheet
More information