Advanced Data Science - PDF Free Download

Advanced Data Science Dr. Kira Radinsky Slides Adapted from Tom M. Mitchell

Agenda Topics Covered: Time series data Markov Models Hidden Markov Models Dynamic Bayes Nets Additional Reading: Bishop: Chapter 13 2

Sequential Data stock market prediction speech recognition gene data analysis O 1 O 2 O T how shall we represent and learn P(O 1, O 2 O T )? 3

Markov Model O 1 O 2 O T how shall we represent and learn P(O 1, O 2 O T )? Use a Bayes net: Markov model: nth order Markov model: 4

N-order Markov Model the Markov property specifies that the probability of a state depends only on the probability of the previous state but we can build more memory into our states by using a higher order Markov model higher order models remember more history - can have higher predictive value O T-3 O T-2 O T-1 O T O T-3 O T-2 O T-1 O T O T-3 O T-2 O T-1 O T 5

Markov Model O 1 O 2 O T how shall we represent and learn P(O 1, O 2 O T )? Use a Bayes net: Markov model: nth order Markov model: if O t real valued, and assume P(O t )~ N( f(o t-1,o t-2 O t-n ), σ), where f is some linear function, called nth order autoregressive (AR) model 6

Hidden Markov Models: Example An experience in a casino Game: 1. You bet $1 2. You roll (always with a fair die) 3. Casino player rolls (sometimes with fair die, sometimes with loaded die) 4. Highest number wins $2 Here is his sequence of die rolls: 1245526462146146136136661664661636 616366163616515615115146123562344 Which die is being used in each play? 7

Question: 1245526462146146136136661664661636 616366163616515615115146123562344 Which die is being used in each play? S " {fair, loaded} Unobserved state: S t-2 S t-1 S t S t+1 S t+2 Observed output: O t-2 O t-1 O t O t+1 O t+2 O " {1,2,3,4,5,6} 8

The Dishonest Casino Model 0.95 0.05 0.95 FAIR LOADED P(1 F) = 1/6 P(2 F) = 1/6 P(3 F) = 1/6 P(4 F) = 1/6 0.05 P(1 L) = 1/10 P(2 L) = 1/10 P(3 L) = 1/10 P(4 L) = 1/10 9

Puzzles Regarding the Dishonest Casino GIVEN: A sequence of rolls by the casino player 1245526462146146136136661664661636616366163616515615115146123562344 QUESTION How likely is this sequence, given our model of how the casino works? This is the EVALUATION problem What portion of the sequence was generated with the fair die, and what portion with the loaded die? This is the DECODING question How loaded is the loaded die? How fair is the fair die? How often does the casino player change from fair to loaded, and back? This is the LEARNING question 10

Definition of HMM S t-2 S t-1 S t 1. N: number of hidden states X can take on X = {1, 2, N} 2. O: set of values O t can take on 3. Initial state distribution: P(X 1 =k) for k=1, 2, N 4. State transition distribution: P(X t+1 =k X t =i ), for k,i =1, 2, N 5. Emission distribution: P(Ot Xt) O t-2 O t-1 O t P S " S "78 ) K Graphical model 1 2 State automata view 11

Handwriting recognition Character recognition, e.g., logistic regression, Naïve Bayes 12

Handwriting recognition S t-2 S t-1 S {a, b,, z } O Character recognition, e.g., logistic regression, Naïve Bayes 13

Example of a hidden Markov model (HMM) 14

Need to evaluates Only 3 distributions! Understanding HMM Semantics P(X 8 = i ) X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = Vector of pixel intensities P(O " X " ) O "8 O "BCC Sharing parameters N(μ F ", σ) θ IJ = P(X "K8 = L b X " = L v ) θ OP 26x26 vector 15

Using and Learning HMM s Core HMM questions: 1. How do we calculate P(o 1, o 2, o n )? 2. How do we calculate argmax over x 1, x 2, x n of P(x 1, x 2, x n o 1, o 2, o n )? 3. How do we train the HMM, given its structureand 3a. Fully observed training examples: < x 1, x n, o 1, o n > 3b. Partially observed training examples: < o 1, o n > 16

How do we compute 1. brute force: LXL LXL P o 8, o Q,, o R = S S S S P(x 8 = i, x Q = j,, x " = k, O 8,, O " ) OY Z [L PY Z [L 17

How do we compute α Q k = P O 8 = o 8, O Q = o Q, X Q = k = P O Q = o Q, X Q = k S P O 8 = o 8, X 8 = j P X Q = k X 8 = j P α 8 j 2. Forward algorithm (dynamic progr., variable elimination): define P O 8 = o 8, X 8 = k = P X 8 = k P(O 8 = o 8 X 8 = k) Q] P O "K8 = o "K8, X "K8 = k S α " j P X "K8 = k X " = j PY8 Q] S α R k ^Y8 18

How do we compute 2. Backward algorithm (dynamic progr., variable elimination): define α R k β " (k) S α R k ^ 19

Russell & Norvig 2010 Chapter 15 pp. 566 Infer the weather given observation of a man either carrying or not carrying an umbrella 2 states of weather: state 1 = rain, state 2 = no rain 70% chance of staying the same each day 30% chance of changing Each state generates 2 events: event 1 = umbrella, event 2 = no umbrella conditional probabilities for events occurring in each state We then observe the following sequence of events: {umbrella, umbrella, no umbrella, umbrella, umbrella} umbrella umbrella No umbrella umbrella umbrella 20

Computing Forward Probabilities: we don't know which state the weather is in before our observations 21

Computing Backward Probabilities: 22

How do we compute Viterbi algorithm, based on recursive computation of 24

Learning HMMs from fully observable data: easy Just 3 distributions: Hope I convinced you. 25

Learning HMMs when only observe o1...ot EM Algorithm (the Baum Welch algorithm) 1. E step: estimate P X 8,, X R O 8 O R ) using forward-backward 2. M step: choose θ to maximize E a Fb,,F c d b d c )logp X 8,, X R, O 8 O R θ) 26

Additional Time Series Models 27

hidden Observation at time t Can depend on all state variables t that time observed hidden observed In factorial HMM as above the S i at time t are independent. Here we couple them + add observable input X i observed 28

Real value hidden Si discrete With M values observed 29

What you need to know Hidden Markov models(hmms) Very useful, very powerful! Speech, OCR, time series, Parameter sharing, only learn 3 distributions Dynamic programming (variable elimination) reduces inference complexity Special case of Bayes net Dynamic Bayesian Networks 30