CPSC 540: Machine Learning

Similar documents
CPSC 540: Machine Learning

0.1 Naive formulation of PageRank

Link Analysis. Stony Brook University CSE545, Fall 2016

Machine Learning CPSC 340. Tutorial 12

CPSC 540: Machine Learning

CPSC 540: Machine Learning

CPSC 540: Machine Learning

The Theory behind PageRank

Introduction to Machine Learning CMU-10701

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Lecture 12: Link Analysis for Web Retrieval

Introduction to Machine Learning CMU-10701

CPSC 540: Machine Learning

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Approximate Inference

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Uncertainty and Randomization

Machine Learning for Data Science (CS4786) Lecture 24

Link Analysis. Leonid E. Zhukov

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

1998: enter Link Analysis

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Markov chain Monte Carlo

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

Introduction to Machine Learning CMU-10701

Reinforcement Learning

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Updating PageRank. Amy Langville Carl Meyer

Graph Models The PageRank Algorithm

1 Probabilities. 1.1 Basics 1 PROBABILITIES

CPSC 540: Machine Learning

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Lesson Plan. AM 121: Introduction to Optimization Models and Methods. Lecture 17: Markov Chains. Yiling Chen SEAS. Stochastic process Markov Chains

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Semi-Markov/Graph Cuts

CS145: Probability & Computing Lecture 18: Discrete Markov Chains, Equilibrium Distributions

Markov Chains and MCMC

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

STA141C: Big Data & High Performance Statistical Computing

Data and Algorithms of the Web

Graphical Models Seminar

CS6220: DATA MINING TECHNIQUES

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Computational Genomics and Molecular Biology, Fall

Data Mining and Matrices

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo

Linear Dynamical Systems (Kalman filter)

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Brief introduction to Markov Chain Monte Carlo

Markov Processes Hamid R. Rabiee

CPSC 540: Machine Learning

Hidden Markov Models. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 19 Apr 2012

A Note on Google s PageRank

Computational statistics

STOCHASTIC PROCESSES Basic notions

Computer Engineering II Solution to Exercise Sheet Chapter 12

Doing Physics with Random Numbers

Lecture 6: Markov Chain Monte Carlo

Online Social Networks and Media. Link Analysis and Web Search

CS6220: DATA MINING TECHNIQUES

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Link Mining PageRank. From Stanford C246

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

CPSC 540: Machine Learning

LEARNING DYNAMIC SYSTEMS: MARKOV MODELS

CONTENTS. Preface List of Symbols and Notation

Entropy Rate of Stochastic Processes

CPSC340. Probability. Nando de Freitas September, 2012 University of British Columbia

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

MAS275 Probability Modelling Exercises

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017

Results: MCMC Dancers, q=10, n=500

UpdatingtheStationary VectorofaMarkovChain. Amy Langville Carl Meyer

Calculating Web Page Authority Using the PageRank Algorithm

Markov Chains. Arnoldo Frigessi Bernd Heidergott November 4, 2015

Web Structure Mining Nodes, Links and Influence

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

ISE/OR 760 Applied Stochastic Modeling

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

ECEN 689 Special Topics in Data Science for Communications Networks

Convex Optimization CMU-10725

Hidden Markov Models Part 1: Introduction

Approximating a single component of the solution to a linear system

Nonparametric Bayesian Methods (Gaussian Processes)

Basic math for biology

Bayesian Methods for Machine Learning

CS249: ADVANCED DATA MINING

CS 188: Artificial Intelligence Spring Announcements

The Ising model and Markov chain Monte Carlo

Google Page Rank Project Linear Algebra Summer 2012

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Announcements. CS 188: Artificial Intelligence Fall Markov Models. Example: Markov Chain. Mini-Forward Algorithm. Example

Transcription:

CPSC 540: Machine Learning Mark Schmidt University of British Columbia Winter 2019

Last Time: Monte Carlo Methods If we want to approximate expectations of random functions, E[g(x)] = g(x)p(x) or E[g(x)] = g(x)p(x)dx, x X x X }{{}}{{} continuous x discrete x the Monte Carlo estimate is E[g(x)] 1 n n g(x i ), i=1 where the x i are independent samples from p(x). We can use this to approximate marginals, p(x j = c) = 1 n I[x i j = c]. n i=1

Exact Marginal Calculation In typical settings Monte Carlo has sublinear converngece like stochastic gradient. O(1/t) convergence rate where constant is variance of samples. If all samples look the same, it converges quickly. If samples look very different, it can be painfully slow. For discrete-state Markov chains, we can actually compute marginals directly: We re given initial probabilities p(x 1 = s) for all s as part of the definition. We can use transition probabilities to compute p(x 2 = s) for all s: p(x 2 ) = k p(x 2, x 1 ) = x 1=1 }{{} marginalization rule k p(x 2 x 1 )p(x 1 ). }{{} product rule x 1=1 We can repeat this calculation to obtain p(x 3 = s) and subsequent marginals.

Exact Marginal Calculation Recursive formula for maginals at time j: k p(x j ) = p(x j x j 1 )p(x j 1 ), x j 1 =1 called the Chapman-Kolmogorov (CK) equations. Cost: Given previous time, CK equations for one x j costs O(k). Given previous time, to compute p(x j ) for all k states costs O(k 2 ). Can be written as matrix-vector product with k k transition probabilities matrix. So cost to compute marginals up to time d is O(dk 2 ). An example of dynamic programming: efficiently sums over k d paths.

Marginals in CS Grad Career CK equations can give all marginals p(x j = c) from CS grad Markov chain: Each row j is a state and each column c is a year.

Continuous-State Markov Chains The CK equations also apply if we have continuous states: p(x j ) = p(x j x j 1 )p(x j 1 ), x j 1 but this integral may not have a closed-form solution. Gaussian probabilities are an important special case: If p(x j 1 ) and p(x j x j 1 ) are Gaussian, then p(x j ) is Gaussian. So we can write p(x j ) in closed-form in terms of mean and variance. If the probabilities are non-gaussian, usually can t represent p(x j ) distribution. You are stuck using Monte Carlo or other approximations.

Stationary Distribution A stationary distribution of a homogeneous Markov chain is a vector π satisfying π(c) = c p(x j = c x j 1 = c )π(c ). The probabilities don t change across time (also called invariant distribution). Under certain conditions, marginals converge to a stationary distribution. p(x j = c) π(c) as j goes to. If we fit a Markov chain to the rain example, we have π( rain ) = 0.41. In the CS grad student example, we have π( dead ) = 1. Stationary distribution is basis for Google s PageRank algorithm.

Application: PageRank Web search before Google: http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf It was also easy to fool search engines by copying popular websites.

State Transition Diagram State transition diagrams are common for visualizing homogenous Markov chains: 0 0 0.2 0.8 P = 0 0 0 1 0.2 0 0 0.8 0 0.45 0.45 0.1 Each node is a state, each edge is a non-zero transition probability. For web-search, each node will be a webpage. Cost of CK equations is only O(z) if you have only z edges.

Application: PageRank Wikipedia s cartoon illustration of Google s PageRank: Large face means higher rank. Important webpages are linked from other important webpages. Link is more meaningful if a webpage has few links. https://en.wikipedia.org/wiki/pagerank

Application: PageRank Google s PageRank algorithm for measuring the importance of a website: Stationary probability in random surfer Markov chain: With probability α, surfer clicks on a random link on the current webpage. Otherwise, surfer goes to a completely random webpage. To compute the stationary distribution, they use the power method: Repeatedly apply the CK equations. Iterations are faster than O(k 2 ) due to sparsity of links. Can be easily parallelized. Achieves a linear convergence rate. More recent works have shown coordinate optimization can be faster.

Application: Game of Thrones PageRank can be used in other applications. Who is the main character in the Game of Thrones books? http://qz.com/650796/mathematicians-mapped-out-every-game-of-thrones-relationship-to-find-the-main-character

Existence/Uniqueness of Stationary Distribution Does a stationary distribution π exist and is it unique? A sufficient condition for existence/uniqueness is that all p(x j = c x j = c ) > 0. PageRank satisfies this by adding probability α of jumping to a random page. Weaker sufficient conditions for existence and uniqueness ( ergodic ): 1 Irreducible (doesn t get stuck in part of the graph). 2 Aperiodic (probability of returning to state isn t on fixed intervals).

Outline 1 2

Decoding: Maximizing Joint Probability Decoding in density models: finding x with highest joint probability: argmax p(x 1, x 2,..., x d ). x 1,x 2,...,x d For CS grad student (d = 60) the decoding is industry for all years. The decoding often doesn t look like a typical sample. The decoding can change if you increase d. Decoding is easy for independent models: We can just optimize each x j independently. For example, with four variables we have max {p(x 1 )p(x 2 )p(x 3 )p(x 4 )} = x 1,x 2,x 3,x 4 ( max x 1 p(x 1 ) Can we also maximize the marginals to decode a Markov chain? ) ( ) ( ) ( ) max p(x 2 ) max p(x 3 ) max p(x 4 ). x 2 x 3 x 4

Example of Decoding vs. Maximizing Marginals Consider the plane of doom 2-variable Markov chain: land alive land alive crash dead X = explode dead. crash dead land alive.. 40% of the time the plane lands and you live. 30% of the time the plane crashes and you die. 30% of the time the plane crashes and you die.

Example of Decoding vs. Maximizing Marginals Initial probabilities are given by p(x 1 = land ) = 0.4, p(x 1 = crash ) = 0.3, p(x 1 = explode ) = 0.3, and x 2 is alive iff x 1 is land. If we apply the CK equations we get p(x 2 = alive ) = 0.4, p(x 2 = dead ) = 0.6, so maximizing the marginals p(x j ) independently gives ( land, dead ). This actually has probability 0. Decoding considers the joint assignment to x 1 and x 2 maximizing probaiblity. In this case it s ( land, alive ), which has probability 0.4.

Distributing Max across Product Note that decoding can t be done forward in time as in CK equations. We need to optimize over all k d assignments to all variables. Even if p(x 1 = 1) = 0.99, the most likely sequence could have x 1 = 2. Fortunately, the Markov property makes the max simplify: max p(x 1, x 2, x 3, x 4 ) = max p(x 4 x 3 )p(x 3 x 2 )p(x 2 x 1 )p(x 1 ) x 1,x 2,x 3,x 4 x 1,x 2,x 3,x 4 = max x 4 = max x 4 = max x 4 max x 3 max x 3 max x 3 max max p(x 4 x 3 )p(x 3 x 2 )p(x 2 x 1 )p(x 1 ) x 2 x 1 max p(x 4 x 3 )p(x 3 x 2 ) max p(x 2 x 1 )p(x 1 ) x 2 x 1 p(x 4 x 3 ) max p(x 3 x 2 ) max p(x 2 x 1 )p(x 1 ), x 2 x 1 where we re using that max i αa i = α max i a i for non-negative α.

Decoding with Memoization The Markov property writes decoding as a sequence of max problems: max p(x 1, x 2, x 3, x 4 ) = max x 1,x 2,x 3,x 4 x 4 max x 3 p(x 4 x 3 ) max p(x 3 x 2 ) max p(x 2 x 1 )p(x 1 ), x 2 x 1 but note that we can t just solve max x1 once because it s a function of x 2. Instead, we ll memoize solution M 2 (x 2 ) = max x1 p(x 2 x 1 )p(x 1 ) for all x 2, max p(x 1, x 2, x 3, x 4 ) = max max p(x 4 x 3 ) max p(x 3 x 2 )M 2 (x 2 ). x 1,x 2,x 3,x 4 x 4 x 3 x 2 Now we memoize solution M 3 (x 3 ) = max x2 p(x 3 x 2 )M 2 (x 2 ) for all x 3, max p(x 1, x 2, x 3, x 4 ) = max max p(x 4 x 3 )M 3 (x 3 ). x 1,x 2,x 3,x 4 x 3 And defining M 4 (x 4 ) = max x3 p(x 4 x 3 )M 2 (x 3 ) the maximum value is given by max x 1,x 2,x 3,x 4 p(x 1, x 2, x 3, x 4 ) = max x 4 M 4 (x 4 ). x 4

Example: Decoding the Plane of Doom We have M 1 (x 1 ) = p(x 1 ) so in plane of doom we have M 1 ( land ) = 0.4, M 1 ( crash ) = 0.3, M 1 ( explode ) = 0.3. We have M 2 (x 2 ) = max x1 p(x 2 x 1 )M 1 (x 1 ) so we get M 2 ( alive ) = 0.4, M 2 ( dead ) = 0.3. M 2 (2) p(x 2 = 2) because we needed to choose either crash or explode. We maximize M 2 (x 2 ) to find that the optimal decoding ends with alive. We now need to backtrack to find the state that lead to alive, giving land.

Viterbi Decoding What is M j (x j ) in words? Probability of most likely length-j sequence that ends in x j (ignoring future). The Viterbi decoding algorithm (special case of dynamic programming): 1 Set M 1 (x 1 ) = p(x 1 ) for all x 1. 2 Compute M 2 (x 2 ) for all x 2, store value of x 1 leading to the best value of each x 2. 3 Compute M 3 (x 3 ) for all x 3, store value of x 2 leading to the best value of each x 3. 4... 5 Maximize M d (x d ) to find value of x d in a decoding. 6 Bactrack to find the value of x d 1 that lead to this x d. 7 Backtrack to find the value of x d 2 that lead to this x d 1. 8... Computing all M j (x j ) given all M j 1 (x j 1 ) costs O(k 2 ). Total cost is only O(dk 2 ) to search over all k d paths. Has numerous applications like decoding digital TV.

Application: Voice Photoshop Application: Adobe VoCo uses Viterbi as part of synthesizing voices: https://www.youtube.com/watch?v=i3l4xlz59iw http://gfx.cs.princeton.edu/pubs/jin_2017_vti/jin2017-voco-paper.pdf

Summary Chapman-Kolmogorov equations compute exact univariate marginals. For discrete or Gaussian Markov chains. Stationary distribution of homogenous Markov chain. Marginals as time goes to. Basis of Google s PageRank method. Decoding is task of finding most probable x. Viterbi decoding allow efficient decoding with Markov chains. Next time: measuring defence in the NBA.

Label Propagation as a Markov Chain Problem Basic label propagation method has a Markov chain interpretation. We have n + t states, one for each [un]labeled example. Monte Carlo approach to label propagation ( adsorption ): At time t = 0, set the state to the node you want to label. At time t > 0 and on a labeled node, output the label. Labeled nodes are absorbing states. At time t > 0 and on an unlabeled node i: Move to neighbour j with probability proportional w ij (or w ij). Final predictions are probabilities of outputting each label. Nice if you only need to label one example at a time (slow if labels are rare). Common hack is to limit random walk time to bound runtime.