Hidden Markov Models (HMM) and Support Vector Machine (SVM)

Hidden Markov Models (HMM) and Support Vector Machine (SVM) Professor Joongheon Kim School of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of Korea 1

Hidden Markov Models (HMM) and Support Vector Machine (SVM) Part 1: Hidden Markov Models Professor Joongheon Kim School of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of Korea 2

Outline Hidden Markov Models Markov Markov Chain Markov Models and Markov Processes Hidden Markov Model (HMM) HMM Applications: Probability Evaluation 3

Markov (Markov Chain) [Definition (P ij )] The fixed probability (one-step transition probability) that it will next be in state j whenever the process is in state i. That is, P ij = P X n+1 = j X n = i, X n 1 = i n 1,, X 1 = i 1, X 0 = i 0 for all states i 0, i 1, i n 1, i, j and all n 0. [Note (Markov Property)] For all states i 0, i 1, i n 1, i, j and all n 0, P ij = P X n+1 = j X n = i, X n 1 = i n 1,, X 1 = i 1, X 0 = i 0 = P X n+1 = j X n = i 4

Markov (Markov Chain) [Note] P ij 0 where i 0, j 0 j=0 P ij = 1 for all i = 0,1, [Markov Chain] P i1 1 P i2 2 P ii i P in n 5

Markov (Markov Chain) [Note (P)] Let P denote the matrix of one-step transition probabilities, i.e., P = P ii P ij P ik P ji P jj P jk P ki P kj P kk P jj P ji j P ij P ii i P ki P jk P ik k P kk P kj 6

Markov (Markov Chain) [Example] There are two milk companies in South Korea, i.e., A and B. Based on last year statistics, the 88% customers of A is currently still with A; and the other 12% customers are now with B. In addition, the 85% customer of B is currently with B; and the other 15% customers are now with A. [Transition Matrix] P = P AA P AB P BA P BB = 0.88 0.12 0.15 0.85 [Markov Chain] P AA = 0.88 P AB = 0.12 [One-Step Transition] If initial market share is A = 0.25 and B = 0.75, i.e., s 0 = 0.25 0.75, the next market share is: s 1 = s 0 P 0.88 0.12 = 0.25 0.75 0.15 0.85 = 0.3325 0.6675 A B P BA = 0.15 P BB = 0.85 7

Markov (Markov Chain) [Example (Multi-Step Transition)] From the P (in previous slide), suppose that we are in state i in time t and we have to compute the probability for being in state i in time t + 2 (denote by P ii 2 ). t t + 1 t + 2 P ii P ij i i j P ik k P ii i P ji P ki P 2 ii = P X n+2 = i X n = i = P ii P ii + P ij P ji + P ik P ki P ii P ij P ik P ii P ij P ik = P ji P jj P jk P ji P jj P jk P ki P kj P kk P ki P kj P kk 2 = P ii [X n = i: State in i in time n] 8

Markov (Markov Models and Markov Processes) Example for Markov Model (Weather Forecasting) Weather State: Sunny (S), Rainy (R), Foggy (F) Today s weather q n depends on previous weather conditions, i.e., q n 1, q n 2,, q 1 : P q n q n 1, q n 2,, q 1 Example: if the previous three weather conditions are q n 1 = S,q n 2 = R, andq n 3 = F, subsequently, the probability where today weather (q n ) is R is as follows: P q n = R q n 1 = S, q n 2 = R, q n 3 = F 9

Markov (Markov Models and Markov Processes) Observation from previous [Example] If we have larger n, it means we have to gather more information. If n = 6, we need to gather 3 (6 1) = 243 weather data. Therefore, we need an assumption (called Markov Assumption) which reduces the number of gathering data. [First-Order Markov Assumption] P q n = S j q n 1 = S i, q n 2 = S k, = P q n = S j q n 1 = S i [Second-Order Markov Assumption] P q 1, q 2,, q n = P q i q i 1 n i=1 10

Markov (Markov Models and Markov Processes) Observation from previous [Example] (Continued) With Markov Assumption, the probability that can observe a sequence q 1, q 2,, q n can be presented by joint probability as follows: P q 1, q 2,, q n = P q 1 P q 2 q 1 P q 3 q 2, q 1 P q n 1 q n 2,, q 1 P q n q n 1,, q 1 = P q 1 P q 2 q 1 P q 3 q 2 P q n 1 q n 2 P q n q n 1 = n i=1 P q i q i 1 when we assume P q 0 = 1 11

Markov (Markov Models and Markov Processes) Example (Weather Forecasting) q n 1 [Weather State Table] q n S R F S 0.8 0.05 0.15 R 0.2 0.6 0.2 F 0.2 0.3 0.5 [Transition Matrix] P = [Transition Diagram] 0.6 0.2 R S 0.05 0.8 0.3 0.8 0.05 0.15 0.2 0.6 0.2 0.2 0.3 0.5 0.2 0.2 0.15 F 0.5 12

Markov (Markov Models and Markov Processes) Example (Weather Forecasting) Case Study: Suppose that yesterday (q 1 ) s weather is Sunny (S). Then, find the probabilities where today (q 2 ) s weather is Sunny (S) and tomorrow (q 3 ) s weather is Rainy (R). (Solutions) P q 2 = S, q 3 = R q 1 = S = P q 3 = R q 2 = S, q 1 = S P q 2 = S q 1 = S = P q 3 = R q 2 = S P q 2 = S q 1 = S = 0.05 0.8 = 0.04 [Markov Assumption] P q 1 = S, q 2 = S, q 3 = R = P q 1 = S P q 2 = S q 1 = S P q 3 = R q 2 = S, q 1 = S = P q 1 = S P q 2 = S q 1 = S P q 3 = R q 2 = S = 1.0 0.8 0.05 = 0.04 [Markov Assumption] 13

Outline Hidden Markov Models Markov Hidden Markov Model (HMM) Example: Weather Example: Balls in Jars HMM Applications: Probability Evaluation 14

HMM (Example: Weather) [Example (Weather)] You are in a house which has no windows. Your friend will visit you once a day. Now, you can estimate weather by checking whether your friend has an umbrella or not. Your friend carries an umbrella with the probabilities of 0.1, 0.8, and 0.3, when the weather is S, R, and F. Observation: With Umbrella (o i = UO) or Without Umbrella (o i = UX). Now, the weather can be estimated by observing 0 i, i 1. Therefore, according to Bayes theorem: P q i o i = P o i q i P q i P o i 15

HMM (Example: Weather) [Example (Weather)] You are in a house which has no windows. Your friend will visit you once a day. Now, you can estimate weather by checking whether your friend has an umbrella or not. Your friend carries an umbrella with the probabilities of 0.1, 0.8, and 0.3, when the weather is S, R, and F. When the sequences of weather and umbrella are given, i.e., q 1,, q n and o 1,, o n, the conditional probability is as follows: P q 1,, q n o 1,, o n = P o 1,, o n q 1,, q n P q 1,, q n P o 1,, o n 16

HMM (Example: Balls in Jars) [Example (Weather)] A room has a curtain and there are three jars and the jars contain balls (colors: red, blue, green, and purple). A person behind the curtain select one jar and pick one ball from there. The person shows the ball and put the ball into the jar. And the person repeats. Notations) b j k : pick one ball from jar j and the color of the ball is k where k = 1,2,3,4 when the color is red, blue, green, and purple, respectively. N: The number of states (i.e., the number of jars): S = S 1,, S N M: The number of observation (i.e., the number of colors): O = O 1,, O M State Transition Matrix A = a ij where a ij = P q t+1 = S j q t = S i and this stands for the case where transition happens from state i to state j. Observation B = b j k where b j k = P O t = o k q t = S j and this stands for the case where k is observed in state j. Initial State Distribution π = π i where π i = P q 1 = S 1. 17

Outline Hidden Markov Models Markov Hidden Markov Model (HMM) HMM Applications: Probability Evaluation 18

HMM Applications: Probability Evaluation [Problem Definition (Probability Evaluation)] When O = o 1, o 2, o 3, and HMM model λ = A, B, π are given, find that the observation sequence can occur from which model with the highest probability? It means that how we can calculate P O λ? [Example] We are about to toss a coin with HMM model λ = A, B, π ; and we want to find the probability of the case where observation is O = T, H, T. 19

HMM Applications: Probability Evaluation [Problem Definition (Probability Evaluation)] When O = o 1, o 2, o 3, and HMM model λ = A, B, π are given, find that the observation sequence can occur from which model with the highest probability? It means that how we can calculate P O λ? [Example] We toss a coin with HMM model λ = A, B, π ; and we want to find the probability of the case where observation sequence is O = T, H, T. The given HMM model λ = A, B, π is as follows: A = 1 1 1 3 3 3 0 1 1 2 2 0 0 1 B = 1 0 1 1 2 2 1 2 3 3 π = 1 3 1 3 1 3 20

HMM Applications: Probability Evaluation [Example] We toss a coin with HMM model λ = A, B, π ; and we want to find the probability of the case where observation sequence is O = T, H, T. The given HMM model λ = A, B, π is as follows: A = 1 3 1 3 1 3 0 1 1 2 2 0 0 1 B = 1 0 1 1 2 2 1 2 3 3 [Transition Diagram] 1/3 1 P[H]=1 P[T]=0 1/3 2 1/3 1/2 1 P[H]=1/2 P[T]=1/2 1/2 3 P[H]=1/3 P[T]=2/3 π = 1 3 1 3 1 3 21

HMM Applications: Probability Evaluation [Example] We toss a coin with HMM model λ = A, B, π ; and we want to find the probability of the case where observation sequence is O = T, H, T. The given HMM model λ = A, B, π is as follows: A = π = 1 3 1 3 1 3 1 3 0 1 1 2 2 0 0 1 1 3 1 3 B = 1 0 1 1 2 2 1 2 3 3 [Trellis] State 1 P[H]=1 P[T]=0 State 2 P[H]=1/2 P[T]=1/2 State 3 P[H]=1/3 P[T]=2/3 t = 0 t = 1 t = 2 22

HMM Applications: Probability Evaluation [Trellis] State 1 P[H]=1 P[T]=0 State 2 P[H]=1/2 P[T]=1/2 State 3 P[H]=1/3 P[T]=2/3 t = 0 t = 1 t = 2 [Probability Evaluation] [Case 1] State 2 State 2 State 2 P 1 T, H, T = π 2 b 2 o 1 = T a 22 b 2 o 2 = H a 22 b 2 o 3 = T = 1 3 1 2 1 2 1 2 1 2 1 2 = 0.0104 [Case 2] State 2 State 2 State 3 P 2 T, H, T = π 2 b 2 o 1 = T a 22 b 2 o 2 = H a 23 b 3 o 3 = T = 1 3 1 2 1 2 1 2 1 2 2 3 = 0.0139 [Case 3] State 2 State 3 State 3 P 2 T, H, T = π 2 b 2 o 1 = T a 23 b 3 o 2 = H a 33 b 3 o 3 = T = 1 3 1 2 1 2 1 3 1 2 3 = 0.0185 [Case 4] State 3 State 3 State 3 P 2 T, H, T = π 3 b 3 o 1 = T a 33 b 3 o 2 = H a 33 b 3 o 3 = T = 1 3 2 3 1 1 3 1 2 3 = 0.0494 P O = 4 i=1 P i T, H, T = 0.0922 23

HMM Applications: Probability Evaluation Forward Algorithm for Probability Evaluation Step 1) Initialization (α 1 i = π i b i o i, 1 i 3) State 1 P[H]=1 P[T]=0 State 2 P[H]=1/2 P[T]=1/2 t = 0 t = 1 t = 2 t = 0 i = 1 i = 2 i = 3 α 1 1 = π 1 b 1 o 1 = T = 1 3 0 = 0 α 1 2 = π 2 b 2 o 1 = T = 1 3 1 2 = 1 6 α 1 3 = π 3 b 3 o 1 = T = 1 3 2 3 = 2 9 State 3 P[H]=1/3 P[T]=2/3 24

HMM Applications: Probability Evaluation Forward Algorithm for Probability Evaluation State 1 P[H]=1 P[T]=0 State 2 P[H]=1/2 P[T]=1/2 State 3 P[H]=1/3 P[T]=2/3 3 Step 2) Derivation (α t+1 j = i=1 α t i a ij b i o t+1, 1 t 2,1 j 3) t = 0 t = 1 t = 2 t = 1 j = 1 j = 2 j = 3 α 2 1 = = 0 α 2 2 = 3 i=1 3 i=1 α 1 i a i1 α 1 i a i2 = 1 6 1 2 1 2 = 1 24 = 0.0417 α 2 3 = 3 i=1 α 1 i a i3 = 1 6 1 2 + 2 9 1 1 3 = 0.1019 b 1 o 2 = H b 2 o 2 = H b 3 o 2 = H 25

HMM Applications: Probability Evaluation Forward Algorithm for Probability Evaluation State 1 P[H]=1 P[T]=0 State 2 P[H]=1/2 P[T]=1/2 State 3 P[H]=1/3 P[T]=2/3 3 Step 2) Derivation (α t+1 j = i=1 α t i a ij b i o t+1, 1 t 2,1 j 3) t = 0 t = 1 t = 2 t = 2 j = 1 j = 2 j = 3 α 3 1 = = 0 α 3 2 = 3 i=1 3 i=1 α 2 i a i1 α 2 i a i2 = 0.0417 1 2 1 2 = 0.0104 α 3 3 = 3 i=1 α 2 i a i3 b 1 o 3 = T b 2 o 3 = T b 3 o 3 = T = 0.0417 1 2 + 0.1019 1 2 = 0. 0818 3 26

HMM Applications: Probability Evaluation Forward Algorithm for Probability Evaluation State 1 P[H]=1 P[T]=0 Step 2) Termination (P O λ = i=1 α 3 i ) t = 0 t = 1 t = 2 3 P O λ = 3 i=1 α 3 i = 0.0922 State 2 P[H]=1/2 P[T]=1/2 State 3 P[H]=1/3 P[T]=2/3 27

Hidden Markov Models (HMM) and Support Vector Machine (SVM) Part 2: Markov Decision Process Professor Joongheon Kim School of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of Korea 28

Outline Markov Decision Process (MDP) Basics Markov Property Policy and Return Value Functions (V, Q) Solving MDP Planning Reinforcement Learning (Value-based) Reinforcement Learning (Policy-based) advanced topic (out of scope) 29

MDP (Basics) Markov Decision Process (MDP) Components: <S, A, R, T, γ> S: Set of states A: Set of actions R: Reward function T: Transition function γ: Discount factor How can we use MDP to model agent in a maze? 30

MDP (Basics) Markov Decision Process (MDP) Components: <S, A, R, T, γ> S: Set of states A: Set of actions R: Reward function T: Transition function γ: Discount factor S: location (x, y) if the maze is a 2D grid A: move up, down, left, or right R: how good was the chosen action? r = R s, a, s -1 for moving (battery used) +1 for jewel? +100 for exit? 33

MDP (Basics) Markov Decision Process (MDP) Components: <S, A, R, T, γ> S: Set of states A: Set of actions R: Reward function T: Transition function γ: Discount factor S: location (x, y) if the maze is a 2D grid A: move up, down, left, or right R: how good was the chosen action? T: where is the robot s new location? T = s s, a Stochastic Transition 34

MDP (Basics) Markov Decision Process (MDP) Components: <S, A, R, T, γ> S: Set of states A: Set of actions R: Reward function T: Transition function γ: Discount factor S: location (x, y) if the maze is a 2D grid A: move up, down, left, or right R: how good was the chosen action? T: where is the robot s new location? γ: how much does future reward worth? 0 γ 1, [γ 0: future reward is near 0 (immediate action is preferred)] 35

MDP (Markov Property) Does s t+1 depend on s 0, s 1,, s t 1, s t? No. Memoryless! Future only depends on present Current state is a sufficient statistic of agent s history No need to remember agent s history s t+1 depends only on s t and a t r t depends only on s t and a t 36

MDP (Policy and Return) Policy π: S A Maps states to actions Gives an action for every state Return Discounted sum of rewards R t = k=0 γ k r t+k Our goal: Find π that maximizes expected return! Could be undiscounted Finite horizon 37

MDP (Value Functions (V, Q)) State Value Function (V) V π s = E π R t s t = s = E π k=0 γ k r t+k s t = s Expected return of starting at state s and following policy π How much return do I expect starting from state s? Action Value Function (Q) Q π s, a = E π R t s t = s, a t = a = E π k=0 γ k r t+k s t = s, a t = a Expected return of starting at state s, taking action a, and then following policy π How much return do I expect starting from state s and taking action a? 38

MDP (Solving MDP: Planning) Again, our goal is to find the optimal policy π s = max π Rπ s If T s s, a and R s, a, s are known, this is a planning problem. We can use dynamic programming to find the optimal policy. Keywords: Bellman equation, value iteration, policy iteration 39

MDP (Solving MDP: Planning) Bellman Equation s S: V s = max a s T s, a, s R s, a, s + γv s Value Iteration s S: V i+1 s max a s T s, a, s R s, a, s + γv s Policy Iteration Policy Evaluation π s S: V k i+1 s T s, π k (s), s R s, π k (s), s π + γv k i s s Policy Improvement π k+1 s = arg max a s T s, a, s R s, a, s + γv πk s 40

MDP (Solving MDP: Reinforcement Learning (Value-based)) If T s s, a and R s, a, s are unknown, this is a reinforcement learning problem. Agent need to interact with the world and gather experience At each time-step, From state s Take action a (a = π(s) if stochastic) Receive reward r End in state s Value-based: learn an optimal value function from these data 41

MDP (Solving MDP: Reinforcement Learning (Value-based)) One way to learn Q(s, a) Use empirical mean return instead of expected return Average sampled returns Q s, a = R 1 s, a + R 2 s, a + + R n s, a n Policy chooses action that max Q(s, a) π(s) = max a Q(s, a) Using V(s) requires the model: π s = arg max a s T s, a, s R s, a, s + γv s 42

Hidden Markov Models (HMM) and Support Vector Machine (SVM) Part 3: Support Vector Machine Professor Joongheon Kim School of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of Korea 43

Outline Main Idea Hyperplane in n-dimensional Space Brief Introduction to Optimization for Support Vector Machine (SVM) SVM for Classification 44

Main Idea How can we classify the give data? Any of these would be fine. But which is the best? 45

Main Idea Gene Y Gap Find a linear decision surface (hyperplane) that can separate patient classes and has the largest distance (i.e., largest gap (or margin)) between border-line patients (i.e., support vectors); Normal Patients Cancer Patients Gene X 46

Main Idea Kernel If linear decision surface does not exist, the data is mapped into a higher dimensional space (feature space) where the separating decision surface is found. The feature space is constructed via mathematical projection (kernel trick). 47

Outline Main Idea Hyperplane in n-dimensional Space Brief Introduction to Optimization for Support Vector Machine (SVM) SVM for Classification 48

Hyperplane in n-dimensional Space [Definition (Hyperplane)] A subspace of one dimension less than its ambient space, i.e., the hyperplane in n-dimensional space means the n 1 subspace. 49

Hyperplane in n-dimensional Space Equations of a Hyperplane An equation of a hyperplane is defined by a point (P 0 ) and a perpendicular vector to the plane (w) at that point. Define vectors: x 0 and x where P is an arbitrary point on a hyperplane. A condition for P to be one the plane is that the vector x x 0 is perpendicular to w: w x x 0 = 0 w x w x 0 = 0 and define b = w x 0 w x + b = 0 The above equations hold for R n when n > 3. 50

Hyperplane in n-dimensional Space Equations of a Hyperplane x 2 = x 1 + tw D = tw = t w w x 2 + b 2 = 0 w x 1 + tw + b 2 = 0 w x 1 + t w 2 + b 2 = 0 w x 1 + b 1 b 1 + t w 2 + b 2 = 0 b 1 + t w 2 + b 2 = 0 t = b 1 b 2 / w 2 Therefore, D = t w = b 1 b 2 / w Distance between two parallel hyperplanes w x + b 1 = 0 and w x + b 2 = 0 is equivalent to D = b 1 b 2 w. 51

Outline Main Idea Hyperplane in n-dimensional Space Brief Introduction to Optimization for Support Vector Machine (SVM) SVM for Classification 52

Brief Introduction to Optimization for Support Vector Machine Now, we understand How to represent data (vectors) How to define a linear decision surface (hyperplane) We need to understand How to efficiently compute the hyperplane that separates two classes with the largest gap? Need to understand the basics of relevant optimization theory 53

Brief Introduction to Optimization for Support Vector Machine Convex Functions A function is called convex if the function lies below the straight line segment connecting two points, for any two points in the interval. Property: Any local minimum is a global minimum. 54

Brief Introduction to Optimization for Support Vector Machine Quadratic programming (QP) Quadratic programming (QP) is a special optimization problem: the function to optimize (objective) is quadratic, subject to linear constraints. Convex QP problems have convex objective functions. These problems can be solved easily and efficiently by greedy algorithms (because every local minimum is a global minimum). 55

Brief Introduction to Optimization for Support Vector Machine Quadratic programming (QP) [Example] Consider x = x 1, x 2 Minimize 1 2 x 2 2 subject to x 1 + x 2 1 0 Quadratic Objective Linear Constraints Consider x = x 1, x 2 Minimize 1 2 x 1 2 + x 2 2 subject to x 1 + x 2 1 0 Quadratic Objective Linear Constraints 56

Outline Main Idea Hyperplane in n-dimensional Space Brief Introduction to Optimization for Support Vector Machine (SVM) SVM for Classification 57

SVM for Classification SVM for Classification (Case 1) Linearly Separable Data; Hard-Margin Linear SVM (Case 2) Not Linearly Separable Data; Soft-Margin Linear SVM (Case 3) Not Linearly Separable Data; Kernel Trick 58

SVM for Classification (Case 1) Linearly Separable Data; Hard-Margin Linear SVM Want to find a classifier (hyperplane) to separate negative instances from the positive ones. An infinite number of such hyperplanes exist. SVMs finds the hyperplane that maximizes the gap between data points on the boundaries (so-called support vectors). If the points on the boundaries are not informative (e.g., due to noise), SVMs will not do well. 59

SVM for Classification (Case 1) Linearly Separable Data; Hard-Margin Linear SVM The gap is distance between two parallel hyperplanes: w x + b = 1 and w x + b = +1 Now, we know that D = b 1 b 2 w, i.e., D = 2 w. Since we have to maximize the gap, we have to minimize w. Or equivalently, we have to minimize 1 2 w 2. 60

SVM for Classification (Case 1) Linearly Separable Data; Hard-Margin Linear SVM In addition, we need to impose constrains that all instances are correctly classified. In our case, w x i + b 1 if y i = 1 w x i + b +1 if y i = +1., i.e., equivalently, y i w x i + b 1. In summary, Minimize 1 2 w 2 subject to y i w x i + b 1, for i = 1,, N 61

SVM for Classification (Case 2) Not Linearly Separable Data; Soft-Margin Linear SVM What if the data is not linearly separable? E.g., there are outliers or noisy measurements, or the data is slightly non-linear. Approach Assign a slack variable to each instance ξ i 0, which can be thought of distance from the separating hyperplane if an instance is misclassified and 0 otherwise. Minimize 1 w 2 + C N 2 i=1 ξ i subject to y i w x i + b 1 ξ i, for i = 1,, N 62

SVM for Classification (Case 3) Not Linearly Separable Data; Kernel Trick Data is not linearly separable in the input space Data is linearly separable in the feature space obtained by a kernel 63

Questions? 64