CPSC 540: Machine Learning

Similar documents
CPSC 540: Machine Learning

0.1 Naive formulation of PageRank

Link Analysis. Stony Brook University CSE545, Fall 2016

CPSC 540: Machine Learning

Introduction to Machine Learning CMU-10701

The Theory behind PageRank

CPSC 540: Machine Learning

Uncertainty and Randomization

Lecture 12: Link Analysis for Web Retrieval

CPSC 540: Machine Learning

Approximate Inference

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Introduction to Machine Learning CMU-10701

Machine Learning for Data Science (CS4786) Lecture 24

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Graph Models The PageRank Algorithm

Computer Engineering II Solution to Exercise Sheet Chapter 12

Link Analysis. Leonid E. Zhukov

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Updating PageRank. Amy Langville Carl Meyer

Graphical Models Seminar

CPSC 540: Machine Learning

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

CPSC 540: Machine Learning

Hidden Markov Models. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 19 Apr 2012

Linear Dynamical Systems (Kalman filter)

Markov chain Monte Carlo

CPSC 540: Machine Learning

1998: enter Link Analysis

Introduction to Machine Learning CMU-10701

Markov Chains and MCMC

Lecture 6: Markov Chain Monte Carlo

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

UpdatingtheStationary VectorofaMarkovChain. Amy Langville Carl Meyer

CPSC340. Probability. Nando de Freitas September, 2012 University of British Columbia

Data Mining and Matrices

Lesson Plan. AM 121: Introduction to Optimization Models and Methods. Lecture 17: Markov Chains. Yiling Chen SEAS. Stochastic process Markov Chains

Entropy Rate of Stochastic Processes

Convex Optimization CMU-10725

LEARNING DYNAMIC SYSTEMS: MARKOV MODELS

Approximating a single component of the solution to a linear system

Calculating Web Page Authority Using the PageRank Algorithm

Computational statistics

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

CS 188: Artificial Intelligence Spring Announcements

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo

STA141C: Big Data & High Performance Statistical Computing

CS 188: Artificial Intelligence

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Markov Chains. Arnoldo Frigessi Bernd Heidergott November 4, 2015

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

Machine Learning CPSC 340. Tutorial 12

Announcements. CS 188: Artificial Intelligence Fall Markov Models. Example: Markov Chain. Mini-Forward Algorithm. Example

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

CS168: The Modern Algorithmic Toolbox Lecture #14: Markov Chain Monte Carlo

Data and Algorithms of the Web

Doing Physics with Random Numbers

Markov Models. CS 188: Artificial Intelligence Fall Example. Mini-Forward Algorithm. Stationary Distributions.

Markov chain Monte Carlo methods in atmospheric remote sensing

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

CS 188: Artificial Intelligence Spring 2009

Nonparametric Bayesian Methods (Gaussian Processes)

CS 188: Artificial Intelligence. Bayes Nets

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

CONTENTS. Preface List of Symbols and Notation

Semi-Markov/Graph Cuts

CS168: The Modern Algorithmic Toolbox Lecture #6: Markov Chain Monte Carlo

Markov Processes Hamid R. Rabiee

CS145: Probability & Computing Lecture 18: Discrete Markov Chains, Equilibrium Distributions

A Note on Google s PageRank

Hidden Markov Models. Vibhav Gogate The University of Texas at Dallas

ISE/OR 760 Applied Stochastic Modeling

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Reinforcement Learning

The Ising model and Markov chain Monte Carlo

CS 188: Artificial Intelligence Fall Recap: Inference Example

Pseudocode for calculating Eigenfactor TM Score and Article Influence TM Score using data from Thomson-Reuters Journal Citations Reports

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Bayesian Networks BY: MOHAMAD ALSABBAGH

CPSC 540: Machine Learning

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Page rank computation HPC course project a.y

Bayesian Methods for Machine Learning

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

STA205 Probability: Week 8 R. Wolpert

STA 4273H: Statistical Machine Learning

Statistical Problem. . We may have an underlying evolving system. (new state) = f(old state, noise) Input data: series of observations X 1, X 2 X t

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Simulation - Lectures - Part III Markov chain Monte Carlo

STA 414/2104: Machine Learning

Basic math for biology

Transcription:

CPSC 540: Machine Learning Mark Schmidt University of British Columbia Winter 2018

Last Time: Monte Carlo Methods If we want to approximate expectations of random functions, E[g(x)] = g(x)p(x) or E[g(x)] = g(x)p(x)dx, x X x X }{{}}{{} continuous x discrete x the Monte Carlo estimate is E[g(x)] 1 n n g(x i ), i=1 where the x i are independent samples from p(x). We can use this to approximate marginals, p(x j = c) = 1 n I[x i j = c]. n i=1

Exact Marginal Calculation In typical settings Monte Carlo has sublinear converngece like stochastic gradient. Speed affected is measured by variance of samples. If all samples look the same, it converges quickly. If samples look very different, it can be painfully slow. We can sometimes avoid Monte Carlo and compute univariate marginals exactly: Markov chains with discrete or Gaussian probabilities. In the discrete case, we re giving p(x 1 ) and we can compute p(x 2 ) using: p(x 2 ) = k p(x 2, x 1 ) x 1 =1 }{{} marginalization rule = k p(x 2 x 1 )p(x 1 ). }{{} product rule x 1 =1 We can repeat this calculation to obtain p(x 3 ) and other subsequent marginals.

Exact Marginal Calculation Recursive marginal formula is called the Chapman-Kolmogorov (CK) equations: p(x j ) = k x j 1 =1 CK equations for one x j costs O(k). p(x j x j 1 )p(x j 1 ). To compute p(x j ) for all k states costs O(k 2 ). Can be written as matrix-vector product with k k transition probabilities matrix. With CK equations, computing all marginals costs O(dk 2 ).

Continuous-State Markov Chains The CK equations also apply if we have continuous states: p(x j ) = p(x j x j 1 )p(x j 1 ). x j 1 Gaussian probabilities are an important special case: If p(x j 1 ) and p(x j x j 1 ) are Gaussian, then p(x j ) is Gaussian. So we can write p(x j ) in closed-form in terms of mean and variance. If the probabilities are non-gaussian, usually can t represent p(x j ) distribution. You are stuck using Monte Carlo or other approximations.

Marginals in CS Grad Career CK equations can give all marginals p(x j = c) from CS grad Markov chain: Each row j is a year and each column c is a state.

Stationary Distribution A stationary distribution of a homogeneous Markov chain is a vector π satisfying π(c) = c p(x j = c x j 1 = c )π(c ). The probabilities don t change across time (also called invariant distribution). Under certain conditions, marginals converge to a stationary distribution. p(x j = c) π(c) as j goes to. If we fit a Markov chain to the rain example, we have π( rain ) = 0.41. In the CS grad student example, we have π( dead ) = 1. Stationary distribution is basis for Google s PageRank algorithm.

State Transition Diagram State transition diagrams are common for visualizing homogenous Markov chains: Each node is a state, each edge is a non-zero transition probability. Cost of CK equations is only O(z) if you have only z edges.

Web search before Google: Application: PageRank http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

Application: PageRank Wikipedia s cartoon illustration of Google s PageRank: Large face means higher rank. Important webpages are linked from other important webpages. Link is more meaningful if a webpage has few links. https://en.wikipedia.org/wiki/pagerank

Application: PageRank Google s PageRank algorithm for measuring the importance of a website: Stationary probability in random surfer Markov chain: With probability α, surfer clicks on a random link on the current webpage. Otherwise, surfer goes to a completely random webpage. To compute the stationary distribution, they use the power method: Repeatedly apply the CK equations. Iterations are faster than O(k 2 ) due to sparsity of links. Can be easily parallelized. Achieves a linear convergence rate. More recent works have shown coordinate optimization can be faster.

Application: Game of Thrones PageRank can be used in other applications. Who is the main character in the Game of Thrones books? http://qz.com/650796/mathematicians-mapped-out-every-game-of-thrones-relationship-to-find-the-main-character

Existence/Uniqueness of Stationary Distribution Does a stationary distribution π exist and is it unique? A sufficient condition for existence/uniqueness is that all p(x j = c x j = c ) > 0. PageRank adds probability α of jumping to a random page. Weaker sufficient conditions for existene and uniqueness ( ergodic ): 1 Irreducible (doesn t get stuck in part of the graph). 2 Aperiodic (probability of returning to state isn t on fixed intervals).

Outline 1 Exact Marginals and PageRank 2

Decoding: Maximizing Joint Probability Decoding in density models: finding x with highest joint probability: argmax p(x 1, x 2,..., x d ). x 1,x 2,...,x d For CS grad student (d = 60) the decoding is industry for all years. The decoding often doesn t look like a typical sample. The decoding can change if you increase d. Decoding is easy for independent models: We can just optimize each x j independently. For example, with four variables we have max {p(x 1 )p(x 2 )p(x 3 )p(x 4 )} = x 1,x 2,x 3,x 4 ( max x 1 p(x 1 ) Can we also maximize the marginals to decode a Markov chain? ) ( ) ( ) ( ) max p(x 2 ) max p(x 3 ) max p(x 4 ). x 2 x 3 x 4

Example of Decoding vs. Maximizing Marginals Consider the plane of doom 2-variable Markov chain: land alive land alive crash dead X = explode dead. crash dead land alive.. Initial probabilities are given by p(x 1 = land ) = 0.4, p(x 1 = crash ) = 0.3, p(x 1 = explode ) = 0.3, and x 2 is alive iff x 1 is land.

Example of Decoding vs. Maximizing Marginals Initial probabilities are given by p(x 1 = land ) = 0.4, p(x 1 = crash ) = 0.3, p(x 1 = explode ) = 0.3, and x 2 is alive iff x 1 is land. If we apply the CK equations we get p(x 2 = alive ) = 0.4, p(x 2 = dead ) = 0.6, so maximizing the marginals p(x j ) independently gives ( land, dead ). This actually has probability 0. Decoding considers the joint assignment to x 1 and x 2 maximizing probaiblity. In this case it s ( land, alive ), which has probability 0.4.

Distributing Max across Product Note that decoding can t be done forward in time as in CK equations. We need to optimize over all k d assignments to all variables. Even if p(x 1 = 1) = 0.99, the most likely sequence could have x 1 = 2. Fortunately, the Markov property makes the max simplify: max p(x 1, x 2, x 3, x 4 ) = max p(x 4 x 3 )p(x 3 x 2 )p(x 2 x 1 )p(x 1 ) x 1,x 2,x 3,x 4 x 1,x 2,x 3,x 4 = max x 4 = max x 4 = max x 4 max x 3 max x 3 max x 3 max max p(x 4 x 3 )p(x 3 x 2 )p(x 2 x 1 )p(x 1 ) x 2 x 1 max p(x 4 x 3 )p(x 3 x 2 ) max p(x 2 x 1 )p(x 1 ) x 2 x 1 p(x 4 x 3 ) max p(x 3 x 2 ) max p(x 2 x 1 )p(x 1 ), x 2 x 1 where we re using that max i αa i = α max i a i for non-negative α.

Decoding with Memoization The Markov property writes decoding as a sequence of max problems: max p(x 1, x 2, x 3, x 4 ) = max x 1,x 2,x 3,x 4 x 4 max x 3 p(x 4 x 3 ) max p(x 3 x 2 ) max p(x 2 x 1 )p(x 1 ), x 2 x 1 but note that we can t just solve max x1 once because it s a function of x 2. Instead, we ll memoize solution M 2 (x 2 ) = max x1 p(x 2 x 1 )p(x 1 ) for all x 2, max p(x 1, x 2, x 3, x 4 ) = max max p(x 4 x 3 ) max p(x 3 x 2 )M 2 (x 2 ). x 1,x 2,x 3,x 4 x 4 x 3 x 2 Now we memoize solution M 3 (x 3 ) = max x2 p(x 3 x 2 )M 2 (x 2 ) for all x 3, max p(x 1, x 2, x 3, x 4 ) = max max p(x 4 x 3 )M 3 (x 3 ). x 1,x 2,x 3,x 4 x 3 And defining M 4 (x 4 ) = max x3 p(x 4 x 3 )M 2 (x 3 ) the maximum value is given by max x 1,x 2,x 3,x 4 p(x 1, x 2, x 3, x 4 ) = max x 4 M 4 (x 4 ). x 4

Example: Decoding the Plane of Doom We have M 1 (x 1 ) = p(x 1 ) so in plane of doom we have M 1 ( land ) = 0.4, M 1 ( crash ) = 0.3, M 1 ( explode ) = 0.3. We have M 2 (x 2 ) = max x1 p(x 2 x 1 )M 1 (x 1 ) so we get M 2 ( alive ) = 0.4, M 2 ( dead ) = 0.3. M 2 (2) p(x 2 = 2) because we needed to choose either crash or explode. We maximize M 2 (x 2 ) to find that optimal decoding ednds with alive. We now need to backtrack to find the state that lead to alive, giving land.

Viterbi Decoding What is M j (x j ) in words? Probability of most likely length-j sequence ending in x j (ignoring future). The Viterbi decoding algorithm (special case of dynamic programming): 1 Set M 1 (x 1 ) = p(x 1 ) for all x 1. 2 Compute M 2 (x 2 ) for all x 2, store value of x 1 leading to the best value of each x 2. 3 Compute M 3 (x 3 ) for all x 3, store value of x 2 leading to the best value of each x 3. 4... 5 Maximize M d (x d ) to find value of x d in a decoding. 6 Bactrack to find the value of x d 1 that lead to this x d. 7 Backtrack to find the value of x d 2 that lead to this x d 1. 8... Computing all M j (x j ) given all M j 1 (x j 1 ) costs O(k 2 ). Total cost is only O(dk 2 ) to search over all k d paths.

Application: Voice Photoshop Application: Adobe VoCo uses Viterbi as part of synthesizing voices: https://www.youtube.com/watch?v=i3l4xlz59iw http://gfx.cs.princeton.edu/pubs/jin_2017_vti/jin2017-voco-paper.pdf

Summary Chapman-Kolmogorov equations compute exact univariate marginals. For discrete or Gaussian Markov chains. Stationary distribution of homogenous Markov chain. Marginals as time goes to. Basis of Google s PageRank method. Decoding is task of finding most probable x. Viterbi decoding allow efficient decoding with Markov chains. Next time: weakening the Markov assumption.