Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Size: px

Start display at page:

Download "Course 495: Advanced Statistical Machine Learning/Pattern Recognition"

Lydia Stone
5 years ago
Views:

1 Course 495: Advanced Statistical Machine Learning/Pattern Recognition Lecturer: Stefanos Zafeiriou Goal (Lectures): To present discrete and continuous valued probabilistic linear dynamical systems (HMMs & Kalman Filters). Goal (Tutorials): To provide the students the necessary mathematical tools for deeply understanding the models. 1

2 Materials Chapter 13: Pattern Recognition & Machine Learning, Christopher M. Bishop. Chapter 17: Machine Learning a Probabilistic Perspective, Kevin Murphy Rabiner, Lawrence. "A tutorial on hidden Markov models and selected applications in speech recognition." Proceedings of the IEEE 77.2 (1989):

3 Linear Dynamical Systems Applications of probabilistic linear dynamical systems Language modelling Object/Face tracking Speech/Gesture recognition Finance Bioinformatics 3

4 Applications Object-target tracking 4

5 Applications Speech Recognition (voice Google search) Waveform Hello world 5

6 Applications Gesture recognition (Kinect games) Gestures 6

likelihood maximization: N i=1 Ν ι=1 p(x

7 Latent Variable Models (Static) General Concept: y 1 y 2 y 3 y N Share a common linear structure x 1 x 2 x 3 x N x N We want to find the parameters: Joint likelihood maximization: N i=1 Ν ι=1 p(x 1,.., x N, y 1,.., y N θ) = p(x i y i, W, μ, σ) p(y i ) 7

8 Latent Variable Models (Dynamic, Continuous) 8

9 Latent Variable Models (Dynamic, Continuous) y 1 y 2 y 3 y T x 1 x 2 x 3 x T Generative Model x n = Wy n + e n y 1 = μ + u y n = Ay n 1 + v n Parameters: Noise distribution e~n(e, Σ) u~n u, P v~n(v, Γ) θ = {W, A, μ, Σ, Γ, P } 9

10 Latent Variable Models (Dynamic, Continuous) y 1 y 2 y 3 y T x 1 x 2 x 3 x T Markov Property: p(y i, y 1,.., y i 1 = p(y i y i 1 ) 1

11 Latent Variable Models (Dynamic, Discrete) Word: need Phonemes: n iy d y 1 x 1 y 2 x 2 y 3 x 3 y T x T Latent structure takes discrete values: y t {start,n,iy,d,end} 11

12 Summarize what we will study? Sequential data (2 weeks): y 1 y 2 y 3 y T x 1 x 2 x 3 x T What are the models?: The Markov & Hidden Markov Models (1 week). The Kalman Filter (1 week). What we will learn?: How to formulate probabilistically the problems and learn parameters. 12

13 Markov Chains with Discrete Random Variables x 1 x 2 x 3 x T Let s assume we have discrete random variables (e.g., taking 3 discrete 1 values x t = {, 1, }) 1 Markov Property: e.g. p(x t = p(x t x 1,.., x t 1 = p(x t x t 1 ) 1 x i 1 = 1 ) Stationary, Homogeneous or Time-Invariant if the distribution p x t x t 1 does not depend on t 13

14 Markov Chains with Discrete Random Variables x 1 x 2 x 3 x T p(x t x t 1 ) bigram model x 1 x 2 x 3 x T p(x t x t 1, x t 2 ) Tri-gram model x 1 x 2 x 3 x 4 x T p(x t x t 1, x t 2, x t 3 ) 4-gram model 14

15 Markov Chains with Discrete Random Variables Joint distribution in the first order case: p(x 1,.., x T ) = p x 1 p(x 2,.., x T x 1 = p x 1 p(x 2 x 1 )p(x 3,.., x T x 1, x 2 = p x 1 p(x 2 x 1 )p(x 3,.., x T x 2 = p x 1 p(x 2 x 1 )p(x 3 x 2 )p(x 4,.., x T x 2, x 3 = p x 1 p(x 2 x 1 )p(x 3 x 2 )p(x 4,.., x T x 3 15 T = p(x 1 ) p(x i x i 1 ) i=2

16 First Order Markov Chains p(x t x t 1 ) can be represented as a KxK transition matrix A = a ij which is the probability of going from state i to state j x t x t 1 3 a 11 a 21 a 12 a 22 a 13 a 23 a 31 a 32 a 33 p x t x t 1 2 A is a stochastic matrix, i.e., 3 a ik = 1 16

17 First Order Markov Chains If we make use of our vector notation of discrete random variable then if x t = 1 has only its j-th element activated x t 1 = has only its i-th element activated 1 then a ij = p(x tj = 1 x t 1i = 1)

18 Transition Matrices A stationary finite-state Markov chain is equivalent to a stochastic automaton. 1 α α 1 β 1 2 β Α = 1 α β α 1 β 18

19 Transition Matrices a 11 a 12 a 22 a a 33 A = a 11 a 12 a 22 a 23 1 a 12 = 1 a 11 a 23 = 1 a 22 19

20 Transition Matrices Transition matrix A specifies the probability of getting from i to j in one step. How can we compute the probability of i to j in exactly n-steps? a ij n = p x t+nj = 1 x ti = 1 2 Probability of getting from i to k in one step and then from k to j in n 1 steps and summing for all k K = p x t+1k = 1 x ti = 1 p x t+nj = 1 x t+1k = 1 K = a ik a kj (n 1) A(n) = AA(n 1) A(n) = A n

21 Stationary Distribution of the Markov Chain Markov model are used to define joint probability distributions over sequences. But can be also interpreted as stochastic dynamical systems, where we hop from one state to another over time. We are interested long term distribution over states, known as stationary distribution of the chain. Important application: Google s Page Rank 21

22 Stationary Distribution of the Markov Chain Assume a Markov Chain. A = [a ij ] = [p(x tj = 1 x t 1i = 1) ] π = [p x i = 1 ] then p x 1i = 1 = p x 1i = 1, x k = 1 π 1j = 22 K K = p x k = 1 p x 1i = 1 x k = 1 K π k a kj π 1 T = π T A

23 Stationary Distribution of the Markov Chain We image iterating these equations. If we ever reach a stage where: π T = π T A we have reached the stationary distribution (also called the invariant distribution or equilibrium distribution) In case of three states the above is written: (π 1 π 2 π 3 ) = (π 1 π 2 π 3 ) 1 a 12 a 13 a 12 a 13 a 21 1 a 21 a 23 a 23 a 31 a 32 1 a 31 a 32 23

24 Stationary Distribution of the Markov Chain so π 1 = π 1 1 a 12 a 13 + π 2 a 21 + π 3 a 31 or π 1 a 12 + a 13 = π 2 a 21 + π 3 a 31 similarly π 2 a 21 + a 23 = π 1 a 12 + π 3 a 13 and π 3 a 31 + a 32 = π 1 a 31 + π 2 a 32 In general, we have π i a ij = π j a ji and π j = 1 j i j i j The probability of being in state i times the net flow out of the state i must equal the probability of being in each other state j times the net flow from that state into i. 24

25 Stationary Distribution of the Markov Chain A T π = π looks like an eigen-analysis problem i.e., π is an eigenvector with eigenvalue 1 Such an eigenvector always exists since A is row-stochastic A1 = 1 and A and A T have the same eigenvalues But the eigenvectors of A are real-valued only when a ij > What happens in the case that a ij =? 25

26 Stationary Distribution of the Markov Chain π Τ (I A) = π Τ 1 = 1 K constraints 1 extra constraint Problem is over constrained Define matrix Μ = I A and replace one column with 1s (π 1 π 2 π 3 ) 1 a 11 a 12 1 a 21 1 a 22 1 a 31 a 32 1 = ( 1) 26

27 Stationary Distribution of the Markov Chain (π 1 π 2 π 3 ) = ( 1) (π 1 π 2 π 3 ) = (.4.4.2) 27

28 Stationary Distribution of the Markov Chain When does a stationary distribution exists State 4 is an absorbing state hence π = (,,,1) is a possible stationary distribution so is π = (.5,.5,,) 28

29 Stationary Distribution of the Markov Chain First necessary condition to have a unique stationary distribution is that the state transition diagram be a singly connected component. Such chains are called irreducible (i.e., you can go from any state to any other state). 1 α α 1 2 β 1 β α = β = 1 Let s start from state 2 t = 2b + 1 state 1 t = 2b state 2 oscillates 29

30 Stationary Distribution of the Markov Chain d i = gcd {t: a ii t > } d 1 = gcd 2,3,4,6,.. = 1 d 2 = gcd 2,3,4,6,.. = 1 d 3 = gcd 3,5,6,.. = State i is aperiodic if d(i) = 1 Markov Chain is aperiodic if d(i) = 1 for all i 3

31 Stationary Distribution of the Markov Chain Every irreducible (singly connected), aperiodic finite state Markov chain has a limiting distribution, which is equal to π, its unique stationary distribution. Special cases and sufficient conditions: Every regular finite state chain has a unique stationary distribution (i.e., a ij (t) > ). 31

32 Stationary Distribution of the Markov Chain Small web (uniform distribution over all states it is connected to) x 1 x 4 x 2 x 3 First step make it regular. x 6 x 5 (π 1 π 2 π 3 π 4 π 5 π 6 ) = ( ) 32

33 Markov Chain for Language Modelling. One important application of Markov Models is to make statistical language models (i.e., probability distributions over sequences of words). Sentence Completion. Predict next word based on the previous one. Data compression. Any density model can be used to define an encoding scheme, by assigning short code-words to more probably strings. Text classification. Any density model can be used as a classconditional density. Automatic essay writing. Sample from p(x 1,.., x T ) 33

34 Simple Parameter Estimation abbbcbbabcbbbabc bbcabbabbcbbbaba abccabbabbcbbbab p(x 1 1,.., x 1 T ) p(x 2 1,.., x 2 T ) p(x N 1,.., x N T ) x = { a b c 1, 1, 1 } 3 K K p x 1 π = π κ x 1k p x t x t 1 = a jk x t 1j x tk j=1 34

35 Maximum Likelihood for Markov Chains What are the parameters in this case? θ = {π, A} The problem is now formulated as: Given a set of observations D l = x l 1,.., x l T, l = 1,.., N find the parameters θ that maximize p D 1,.., D Ν θ p D 1,.., D Ν θ = p D l θ N l=1 35

36 Maximum Likelihood for Markov Chains p D l θ = p(x 1 l,.., x T l θ) = p(x 1 l ) p(x t l x t 1 l ) 3 T 3 = π κ x 1k l t=2 j=1 T t=2 3 x a l t 1j x l tk jk ln N 3 T K K p D 1,.., D Ν θ = π κ x 1k l l=1 t=2 j=1 N 3 N T K K a jk x t 1j l x tk l ln p(θ) = x 1k l ln π κ + x l t 1j x l tk ln a jk l=1 l=1 t=2 j=1 36

37 Maximum Likelihood for Markov Chains N 3 = x 1k l ln π κ N T x t 1j l x tk l ln a jk l=1 l=1 t=2 j=1 3 N 3 3 N T = x 1k l ln π κ + x t 1j l x tk l ln a jk l=1 j=1 l=1 t=2 Let us define the counts N N T N k 1 x 1k l N jk = x t 1j l x tk l l=1 l=1 t=2 37

38 Maximum Likelihood for Markov Chains = N 1 k ln π κ + N jk ln a jk j=1 3 3 Solve the above subject to: The Lagrangian is: 3 π κ 3 3 = 1 a jk = 1 L π, Α = N 1 k ln π κ + N jk ln a jk j=1 3 3 λ π κ 1 γ a jk 1

39 Maximum Likelihood for Markov Chains which gives us: π k = N k 1 N k 1 3 a jk = N jk 3 N jk 39

Linear Dynamical Systems (Kalman filter)

Linear Dynamical Systems (Kalman filter) (a) Overview of HMMs (b) From HMMs to Linear Dynamical Systems (LDS) 1 Markov Chains with Discrete Random Variables x 1 x 2 x 3 x T Let s assume we have discrete