Hidden Markov Models (HMM) and Support Vector Machine (SVM)

Similar documents
Final Exam December 12, 2017

Brief Introduction of Machine Learning Techniques for Content Analysis

Final Exam December 12, 2017

Decision Theory: Q-Learning

Final Exam, Fall 2002

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Decision Theory: Markov Decision Processes

Support Vector Machine (continued)

Announcements - Homework

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Reinforcement Learning and Control

Support Vector Machine (SVM) and Kernel Methods

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Final Examination CS540-2: Introduction to Artificial Intelligence

Reinforcement Learning

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Statistical Pattern Recognition

Markov Decision Processes Chapter 17. Mausam

16.410/413 Principles of Autonomy and Decision Making

L5 Support Vector Classification

Parametric Models Part III: Hidden Markov Models

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Support Vector Machine (SVM) and Kernel Methods

1. (3 pts) In MDPs, the values of states are related by the Bellman equation: U(s) = R(s) + γ max a

CSE250A Fall 12: Discussion Week 9

, and rewards and transition matrices as shown below:

Support vector machines Lecture 4

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Stochastic Primal-Dual Methods for Reinforcement Learning

CS 7180: Behavioral Modeling and Decisionmaking

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Reinforcement Learning. Introduction

Introduction to Reinforcement Learning

Hidden Markov Models

Mathematical Optimization Models and Applications

Kernelized Perceptron Support Vector Machines

CS 188: Artificial Intelligence Spring Announcements

Lecture 3: Markov Decision Processes

Linear & nonlinear classifiers

Reinforcement Learning and Deep Reinforcement Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Support Vector Machine. Industrial AI Lab.

Reinforcement Learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Markov decision processes

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Support Vector Machines Explained

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

MDP Preliminaries. Nan Jiang. February 10, 2019

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Lecture 11: Hidden Markov Models

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Discrete planning (an introduction)

The Reinforcement Learning Problem

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Introduction to Support Vector Machines

Markov Decision Processes Chapter 17. Mausam

CSE 546 Final Exam, Autumn 2013

CS325 Artificial Intelligence Ch. 15,20 Hidden Markov Models and Particle Filtering

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Machine Learning, Fall 2010

Linear & nonlinear classifiers

Markov Decision Processes and Solving Finite Problems. February 8, 2017

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Figure 1: Bayes Net. (a) (2 points) List all independence and conditional independence relationships implied by this Bayes net.

Machine Learning I Reinforcement Learning

1 [15 points] Search Strategies

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines

1 MDP Value Iteration Algorithm

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Grundlagen der Künstlichen Intelligenz

Machine Learning. Support Vector Machines. Manfred Huber

Artificial Intelligence & Sequential Decision Problems

Final Examination CS 540-2: Introduction to Artificial Intelligence

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Reinforcement Learning

L23: hidden Markov models

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Lecture 3: The Reinforcement Learning Problem

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning

REINFORCEMENT LEARNING

Final Exam, Machine Learning, Spring 2009

Reinforcement Learning: An Introduction

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machines

The Perceptron Algorithm, Margins

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Transcription:

Hidden Markov Models (HMM) and Support Vector Machine (SVM) Professor Joongheon Kim School of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of Korea 1

Hidden Markov Models (HMM) and Support Vector Machine (SVM) Part 1: Hidden Markov Models Professor Joongheon Kim School of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of Korea 2

Outline Hidden Markov Models Markov Markov Chain Markov Models and Markov Processes Hidden Markov Model (HMM) HMM Applications: Probability Evaluation 3

Markov (Markov Chain) [Definition (P ij )] The fixed probability (one-step transition probability) that it will next be in state j whenever the process is in state i. That is, P ij = P X n+1 = j X n = i, X n 1 = i n 1,, X 1 = i 1, X 0 = i 0 for all states i 0, i 1, i n 1, i, j and all n 0. [Note (Markov Property)] For all states i 0, i 1, i n 1, i, j and all n 0, P ij = P X n+1 = j X n = i, X n 1 = i n 1,, X 1 = i 1, X 0 = i 0 = P X n+1 = j X n = i 4

Markov (Markov Chain) [Note] P ij 0 where i 0, j 0 j=0 P ij = 1 for all i = 0,1, [Markov Chain] P i1 1 P i2 2 P ii i P in n 5

Markov (Markov Chain) [Note (P)] Let P denote the matrix of one-step transition probabilities, i.e., P = P ii P ij P ik P ji P jj P jk P ki P kj P kk P jj P ji j P ij P ii i P ki P jk P ik k P kk P kj 6

Markov (Markov Chain) [Example] There are two milk companies in South Korea, i.e., A and B. Based on last year statistics, the 88% customers of A is currently still with A; and the other 12% customers are now with B. In addition, the 85% customer of B is currently with B; and the other 15% customers are now with A. [Transition Matrix] P = P AA P AB P BA P BB = 0.88 0.12 0.15 0.85 [Markov Chain] P AA = 0.88 P AB = 0.12 [One-Step Transition] If initial market share is A = 0.25 and B = 0.75, i.e., s 0 = 0.25 0.75, the next market share is: s 1 = s 0 P 0.88 0.12 = 0.25 0.75 0.15 0.85 = 0.3325 0.6675 A B P BA = 0.15 P BB = 0.85 7

Markov (Markov Chain) [Example (Multi-Step Transition)] From the P (in previous slide), suppose that we are in state i in time t and we have to compute the probability for being in state i in time t + 2 (denote by P ii 2 ). t t + 1 t + 2 P ii P ij i i j P ik k P ii i P ji P ki P 2 ii = P X n+2 = i X n = i = P ii P ii + P ij P ji + P ik P ki P ii P ij P ik P ii P ij P ik = P ji P jj P jk P ji P jj P jk P ki P kj P kk P ki P kj P kk 2 = P ii [X n = i: State in i in time n] 8

Markov (Markov Models and Markov Processes) Example for Markov Model (Weather Forecasting) Weather State: Sunny (S), Rainy (R), Foggy (F) Today s weather q n depends on previous weather conditions, i.e., q n 1, q n 2,, q 1 : P q n q n 1, q n 2,, q 1 Example: if the previous three weather conditions are q n 1 = S,q n 2 = R, andq n 3 = F, subsequently, the probability where today weather (q n ) is R is as follows: P q n = R q n 1 = S, q n 2 = R, q n 3 = F 9

Markov (Markov Models and Markov Processes) Observation from previous [Example] If we have larger n, it means we have to gather more information. If n = 6, we need to gather 3 (6 1) = 243 weather data. Therefore, we need an assumption (called Markov Assumption) which reduces the number of gathering data. [First-Order Markov Assumption] P q n = S j q n 1 = S i, q n 2 = S k, = P q n = S j q n 1 = S i [Second-Order Markov Assumption] P q 1, q 2,, q n = P q i q i 1 n i=1 10

Markov (Markov Models and Markov Processes) Observation from previous [Example] (Continued) With Markov Assumption, the probability that can observe a sequence q 1, q 2,, q n can be presented by joint probability as follows: P q 1, q 2,, q n = P q 1 P q 2 q 1 P q 3 q 2, q 1 P q n 1 q n 2,, q 1 P q n q n 1,, q 1 = P q 1 P q 2 q 1 P q 3 q 2 P q n 1 q n 2 P q n q n 1 = n i=1 P q i q i 1 when we assume P q 0 = 1 11

Markov (Markov Models and Markov Processes) Example (Weather Forecasting) q n 1 [Weather State Table] q n S R F S 0.8 0.05 0.15 R 0.2 0.6 0.2 F 0.2 0.3 0.5 [Transition Matrix] P = [Transition Diagram] 0.6 0.2 R S 0.05 0.8 0.3 0.8 0.05 0.15 0.2 0.6 0.2 0.2 0.3 0.5 0.2 0.2 0.15 F 0.5 12

Markov (Markov Models and Markov Processes) Example (Weather Forecasting) Case Study: Suppose that yesterday (q 1 ) s weather is Sunny (S). Then, find the probabilities where today (q 2 ) s weather is Sunny (S) and tomorrow (q 3 ) s weather is Rainy (R). (Solutions) P q 2 = S, q 3 = R q 1 = S = P q 3 = R q 2 = S, q 1 = S P q 2 = S q 1 = S = P q 3 = R q 2 = S P q 2 = S q 1 = S = 0.05 0.8 = 0.04 [Markov Assumption] P q 1 = S, q 2 = S, q 3 = R = P q 1 = S P q 2 = S q 1 = S P q 3 = R q 2 = S, q 1 = S = P q 1 = S P q 2 = S q 1 = S P q 3 = R q 2 = S = 1.0 0.8 0.05 = 0.04 [Markov Assumption] 13

Outline Hidden Markov Models Markov Hidden Markov Model (HMM) Example: Weather Example: Balls in Jars HMM Applications: Probability Evaluation 14

HMM (Example: Weather) [Example (Weather)] You are in a house which has no windows. Your friend will visit you once a day. Now, you can estimate weather by checking whether your friend has an umbrella or not. Your friend carries an umbrella with the probabilities of 0.1, 0.8, and 0.3, when the weather is S, R, and F. Observation: With Umbrella (o i = UO) or Without Umbrella (o i = UX). Now, the weather can be estimated by observing 0 i, i 1. Therefore, according to Bayes theorem: P q i o i = P o i q i P q i P o i 15

HMM (Example: Weather) [Example (Weather)] You are in a house which has no windows. Your friend will visit you once a day. Now, you can estimate weather by checking whether your friend has an umbrella or not. Your friend carries an umbrella with the probabilities of 0.1, 0.8, and 0.3, when the weather is S, R, and F. When the sequences of weather and umbrella are given, i.e., q 1,, q n and o 1,, o n, the conditional probability is as follows: P q 1,, q n o 1,, o n = P o 1,, o n q 1,, q n P q 1,, q n P o 1,, o n 16

HMM (Example: Balls in Jars) [Example (Weather)] A room has a curtain and there are three jars and the jars contain balls (colors: red, blue, green, and purple). A person behind the curtain select one jar and pick one ball from there. The person shows the ball and put the ball into the jar. And the person repeats. Notations) b j k : pick one ball from jar j and the color of the ball is k where k = 1,2,3,4 when the color is red, blue, green, and purple, respectively. N: The number of states (i.e., the number of jars): S = S 1,, S N M: The number of observation (i.e., the number of colors): O = O 1,, O M State Transition Matrix A = a ij where a ij = P q t+1 = S j q t = S i and this stands for the case where transition happens from state i to state j. Observation B = b j k where b j k = P O t = o k q t = S j and this stands for the case where k is observed in state j. Initial State Distribution π = π i where π i = P q 1 = S 1. 17

Outline Hidden Markov Models Markov Hidden Markov Model (HMM) HMM Applications: Probability Evaluation 18

HMM Applications: Probability Evaluation [Problem Definition (Probability Evaluation)] When O = o 1, o 2, o 3, and HMM model λ = A, B, π are given, find that the observation sequence can occur from which model with the highest probability? It means that how we can calculate P O λ? [Example] We are about to toss a coin with HMM model λ = A, B, π ; and we want to find the probability of the case where observation is O = T, H, T. 19

HMM Applications: Probability Evaluation [Problem Definition (Probability Evaluation)] When O = o 1, o 2, o 3, and HMM model λ = A, B, π are given, find that the observation sequence can occur from which model with the highest probability? It means that how we can calculate P O λ? [Example] We toss a coin with HMM model λ = A, B, π ; and we want to find the probability of the case where observation sequence is O = T, H, T. The given HMM model λ = A, B, π is as follows: A = 1 1 1 3 3 3 0 1 1 2 2 0 0 1 B = 1 0 1 1 2 2 1 2 3 3 π = 1 3 1 3 1 3 20

HMM Applications: Probability Evaluation [Example] We toss a coin with HMM model λ = A, B, π ; and we want to find the probability of the case where observation sequence is O = T, H, T. The given HMM model λ = A, B, π is as follows: A = 1 3 1 3 1 3 0 1 1 2 2 0 0 1 B = 1 0 1 1 2 2 1 2 3 3 [Transition Diagram] 1/3 1 P[H]=1 P[T]=0 1/3 2 1/3 1/2 1 P[H]=1/2 P[T]=1/2 1/2 3 P[H]=1/3 P[T]=2/3 π = 1 3 1 3 1 3 21

HMM Applications: Probability Evaluation [Example] We toss a coin with HMM model λ = A, B, π ; and we want to find the probability of the case where observation sequence is O = T, H, T. The given HMM model λ = A, B, π is as follows: A = π = 1 3 1 3 1 3 1 3 0 1 1 2 2 0 0 1 1 3 1 3 B = 1 0 1 1 2 2 1 2 3 3 [Trellis] State 1 P[H]=1 P[T]=0 State 2 P[H]=1/2 P[T]=1/2 State 3 P[H]=1/3 P[T]=2/3 t = 0 t = 1 t = 2 22

HMM Applications: Probability Evaluation [Trellis] State 1 P[H]=1 P[T]=0 State 2 P[H]=1/2 P[T]=1/2 State 3 P[H]=1/3 P[T]=2/3 t = 0 t = 1 t = 2 [Probability Evaluation] [Case 1] State 2 State 2 State 2 P 1 T, H, T = π 2 b 2 o 1 = T a 22 b 2 o 2 = H a 22 b 2 o 3 = T = 1 3 1 2 1 2 1 2 1 2 1 2 = 0.0104 [Case 2] State 2 State 2 State 3 P 2 T, H, T = π 2 b 2 o 1 = T a 22 b 2 o 2 = H a 23 b 3 o 3 = T = 1 3 1 2 1 2 1 2 1 2 2 3 = 0.0139 [Case 3] State 2 State 3 State 3 P 2 T, H, T = π 2 b 2 o 1 = T a 23 b 3 o 2 = H a 33 b 3 o 3 = T = 1 3 1 2 1 2 1 3 1 2 3 = 0.0185 [Case 4] State 3 State 3 State 3 P 2 T, H, T = π 3 b 3 o 1 = T a 33 b 3 o 2 = H a 33 b 3 o 3 = T = 1 3 2 3 1 1 3 1 2 3 = 0.0494 P O = 4 i=1 P i T, H, T = 0.0922 23

HMM Applications: Probability Evaluation Forward Algorithm for Probability Evaluation Step 1) Initialization (α 1 i = π i b i o i, 1 i 3) State 1 P[H]=1 P[T]=0 State 2 P[H]=1/2 P[T]=1/2 t = 0 t = 1 t = 2 t = 0 i = 1 i = 2 i = 3 α 1 1 = π 1 b 1 o 1 = T = 1 3 0 = 0 α 1 2 = π 2 b 2 o 1 = T = 1 3 1 2 = 1 6 α 1 3 = π 3 b 3 o 1 = T = 1 3 2 3 = 2 9 State 3 P[H]=1/3 P[T]=2/3 24

HMM Applications: Probability Evaluation Forward Algorithm for Probability Evaluation State 1 P[H]=1 P[T]=0 State 2 P[H]=1/2 P[T]=1/2 State 3 P[H]=1/3 P[T]=2/3 3 Step 2) Derivation (α t+1 j = i=1 α t i a ij b i o t+1, 1 t 2,1 j 3) t = 0 t = 1 t = 2 t = 1 j = 1 j = 2 j = 3 α 2 1 = = 0 α 2 2 = 3 i=1 3 i=1 α 1 i a i1 α 1 i a i2 = 1 6 1 2 1 2 = 1 24 = 0.0417 α 2 3 = 3 i=1 α 1 i a i3 = 1 6 1 2 + 2 9 1 1 3 = 0.1019 b 1 o 2 = H b 2 o 2 = H b 3 o 2 = H 25

HMM Applications: Probability Evaluation Forward Algorithm for Probability Evaluation State 1 P[H]=1 P[T]=0 State 2 P[H]=1/2 P[T]=1/2 State 3 P[H]=1/3 P[T]=2/3 3 Step 2) Derivation (α t+1 j = i=1 α t i a ij b i o t+1, 1 t 2,1 j 3) t = 0 t = 1 t = 2 t = 2 j = 1 j = 2 j = 3 α 3 1 = = 0 α 3 2 = 3 i=1 3 i=1 α 2 i a i1 α 2 i a i2 = 0.0417 1 2 1 2 = 0.0104 α 3 3 = 3 i=1 α 2 i a i3 b 1 o 3 = T b 2 o 3 = T b 3 o 3 = T = 0.0417 1 2 + 0.1019 1 2 = 0. 0818 3 26

HMM Applications: Probability Evaluation Forward Algorithm for Probability Evaluation State 1 P[H]=1 P[T]=0 Step 2) Termination (P O λ = i=1 α 3 i ) t = 0 t = 1 t = 2 3 P O λ = 3 i=1 α 3 i = 0.0922 State 2 P[H]=1/2 P[T]=1/2 State 3 P[H]=1/3 P[T]=2/3 27

Hidden Markov Models (HMM) and Support Vector Machine (SVM) Part 2: Markov Decision Process Professor Joongheon Kim School of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of Korea 28

Outline Markov Decision Process (MDP) Basics Markov Property Policy and Return Value Functions (V, Q) Solving MDP Planning Reinforcement Learning (Value-based) Reinforcement Learning (Policy-based) advanced topic (out of scope) 29

MDP (Basics) Markov Decision Process (MDP) Components: <S, A, R, T, γ> S: Set of states A: Set of actions R: Reward function T: Transition function γ: Discount factor How can we use MDP to model agent in a maze? 30

MDP (Basics) Markov Decision Process (MDP) Components: <S, A, R, T, γ> S: Set of states A: Set of actions R: Reward function T: Transition function γ: Discount factor S: location (x, y) if the maze is a 2D grid s 0 : starting state s: current state s : next state s t : state at time t 31

MDP (Basics) Markov Decision Process (MDP) Components: <S, A, R, T, γ> S: Set of states A: Set of actions R: Reward function T: Transition function γ: Discount factor S: location (x, y) if the maze is a 2D grid A: move up, down, left, or right s s 32

MDP (Basics) Markov Decision Process (MDP) Components: <S, A, R, T, γ> S: Set of states A: Set of actions R: Reward function T: Transition function γ: Discount factor S: location (x, y) if the maze is a 2D grid A: move up, down, left, or right R: how good was the chosen action? r = R s, a, s -1 for moving (battery used) +1 for jewel? +100 for exit? 33

MDP (Basics) Markov Decision Process (MDP) Components: <S, A, R, T, γ> S: Set of states A: Set of actions R: Reward function T: Transition function γ: Discount factor S: location (x, y) if the maze is a 2D grid A: move up, down, left, or right R: how good was the chosen action? T: where is the robot s new location? T = s s, a Stochastic Transition 34

MDP (Basics) Markov Decision Process (MDP) Components: <S, A, R, T, γ> S: Set of states A: Set of actions R: Reward function T: Transition function γ: Discount factor S: location (x, y) if the maze is a 2D grid A: move up, down, left, or right R: how good was the chosen action? T: where is the robot s new location? γ: how much does future reward worth? 0 γ 1, [γ 0: future reward is near 0 (immediate action is preferred)] 35

MDP (Markov Property) Does s t+1 depend on s 0, s 1,, s t 1, s t? No. Memoryless! Future only depends on present Current state is a sufficient statistic of agent s history No need to remember agent s history s t+1 depends only on s t and a t r t depends only on s t and a t 36

MDP (Policy and Return) Policy π: S A Maps states to actions Gives an action for every state Return Discounted sum of rewards R t = k=0 γ k r t+k Our goal: Find π that maximizes expected return! Could be undiscounted Finite horizon 37

MDP (Value Functions (V, Q)) State Value Function (V) V π s = E π R t s t = s = E π k=0 γ k r t+k s t = s Expected return of starting at state s and following policy π How much return do I expect starting from state s? Action Value Function (Q) Q π s, a = E π R t s t = s, a t = a = E π k=0 γ k r t+k s t = s, a t = a Expected return of starting at state s, taking action a, and then following policy π How much return do I expect starting from state s and taking action a? 38

MDP (Solving MDP: Planning) Again, our goal is to find the optimal policy π s = max π Rπ s If T s s, a and R s, a, s are known, this is a planning problem. We can use dynamic programming to find the optimal policy. Keywords: Bellman equation, value iteration, policy iteration 39

MDP (Solving MDP: Planning) Bellman Equation s S: V s = max a s T s, a, s R s, a, s + γv s Value Iteration s S: V i+1 s max a s T s, a, s R s, a, s + γv s Policy Iteration Policy Evaluation π s S: V k i+1 s T s, π k (s), s R s, π k (s), s π + γv k i s s Policy Improvement π k+1 s = arg max a s T s, a, s R s, a, s + γv πk s 40

MDP (Solving MDP: Reinforcement Learning (Value-based)) If T s s, a and R s, a, s are unknown, this is a reinforcement learning problem. Agent need to interact with the world and gather experience At each time-step, From state s Take action a (a = π(s) if stochastic) Receive reward r End in state s Value-based: learn an optimal value function from these data 41

MDP (Solving MDP: Reinforcement Learning (Value-based)) One way to learn Q(s, a) Use empirical mean return instead of expected return Average sampled returns Q s, a = R 1 s, a + R 2 s, a + + R n s, a n Policy chooses action that max Q(s, a) π(s) = max a Q(s, a) Using V(s) requires the model: π s = arg max a s T s, a, s R s, a, s + γv s 42

Hidden Markov Models (HMM) and Support Vector Machine (SVM) Part 3: Support Vector Machine Professor Joongheon Kim School of Computer Science and Engineering, Chung-Ang University, Seoul, Republic of Korea 43

Outline Main Idea Hyperplane in n-dimensional Space Brief Introduction to Optimization for Support Vector Machine (SVM) SVM for Classification 44

Main Idea How can we classify the give data? Any of these would be fine. But which is the best? 45

Main Idea Gene Y Gap Find a linear decision surface (hyperplane) that can separate patient classes and has the largest distance (i.e., largest gap (or margin)) between border-line patients (i.e., support vectors); Normal Patients Cancer Patients Gene X 46

Main Idea Kernel If linear decision surface does not exist, the data is mapped into a higher dimensional space (feature space) where the separating decision surface is found. The feature space is constructed via mathematical projection (kernel trick). 47

Outline Main Idea Hyperplane in n-dimensional Space Brief Introduction to Optimization for Support Vector Machine (SVM) SVM for Classification 48

Hyperplane in n-dimensional Space [Definition (Hyperplane)] A subspace of one dimension less than its ambient space, i.e., the hyperplane in n-dimensional space means the n 1 subspace. 49

Hyperplane in n-dimensional Space Equations of a Hyperplane An equation of a hyperplane is defined by a point (P 0 ) and a perpendicular vector to the plane (w) at that point. Define vectors: x 0 and x where P is an arbitrary point on a hyperplane. A condition for P to be one the plane is that the vector x x 0 is perpendicular to w: w x x 0 = 0 w x w x 0 = 0 and define b = w x 0 w x + b = 0 The above equations hold for R n when n > 3. 50

Hyperplane in n-dimensional Space Equations of a Hyperplane x 2 = x 1 + tw D = tw = t w w x 2 + b 2 = 0 w x 1 + tw + b 2 = 0 w x 1 + t w 2 + b 2 = 0 w x 1 + b 1 b 1 + t w 2 + b 2 = 0 b 1 + t w 2 + b 2 = 0 t = b 1 b 2 / w 2 Therefore, D = t w = b 1 b 2 / w Distance between two parallel hyperplanes w x + b 1 = 0 and w x + b 2 = 0 is equivalent to D = b 1 b 2 w. 51

Outline Main Idea Hyperplane in n-dimensional Space Brief Introduction to Optimization for Support Vector Machine (SVM) SVM for Classification 52

Brief Introduction to Optimization for Support Vector Machine Now, we understand How to represent data (vectors) How to define a linear decision surface (hyperplane) We need to understand How to efficiently compute the hyperplane that separates two classes with the largest gap? Need to understand the basics of relevant optimization theory 53

Brief Introduction to Optimization for Support Vector Machine Convex Functions A function is called convex if the function lies below the straight line segment connecting two points, for any two points in the interval. Property: Any local minimum is a global minimum. 54

Brief Introduction to Optimization for Support Vector Machine Quadratic programming (QP) Quadratic programming (QP) is a special optimization problem: the function to optimize (objective) is quadratic, subject to linear constraints. Convex QP problems have convex objective functions. These problems can be solved easily and efficiently by greedy algorithms (because every local minimum is a global minimum). 55

Brief Introduction to Optimization for Support Vector Machine Quadratic programming (QP) [Example] Consider x = x 1, x 2 Minimize 1 2 x 2 2 subject to x 1 + x 2 1 0 Quadratic Objective Linear Constraints Consider x = x 1, x 2 Minimize 1 2 x 1 2 + x 2 2 subject to x 1 + x 2 1 0 Quadratic Objective Linear Constraints 56

Outline Main Idea Hyperplane in n-dimensional Space Brief Introduction to Optimization for Support Vector Machine (SVM) SVM for Classification 57

SVM for Classification SVM for Classification (Case 1) Linearly Separable Data; Hard-Margin Linear SVM (Case 2) Not Linearly Separable Data; Soft-Margin Linear SVM (Case 3) Not Linearly Separable Data; Kernel Trick 58

SVM for Classification (Case 1) Linearly Separable Data; Hard-Margin Linear SVM Want to find a classifier (hyperplane) to separate negative instances from the positive ones. An infinite number of such hyperplanes exist. SVMs finds the hyperplane that maximizes the gap between data points on the boundaries (so-called support vectors). If the points on the boundaries are not informative (e.g., due to noise), SVMs will not do well. 59

SVM for Classification (Case 1) Linearly Separable Data; Hard-Margin Linear SVM The gap is distance between two parallel hyperplanes: w x + b = 1 and w x + b = +1 Now, we know that D = b 1 b 2 w, i.e., D = 2 w. Since we have to maximize the gap, we have to minimize w. Or equivalently, we have to minimize 1 2 w 2. 60

SVM for Classification (Case 1) Linearly Separable Data; Hard-Margin Linear SVM In addition, we need to impose constrains that all instances are correctly classified. In our case, w x i + b 1 if y i = 1 w x i + b +1 if y i = +1., i.e., equivalently, y i w x i + b 1. In summary, Minimize 1 2 w 2 subject to y i w x i + b 1, for i = 1,, N 61

SVM for Classification (Case 2) Not Linearly Separable Data; Soft-Margin Linear SVM What if the data is not linearly separable? E.g., there are outliers or noisy measurements, or the data is slightly non-linear. Approach Assign a slack variable to each instance ξ i 0, which can be thought of distance from the separating hyperplane if an instance is misclassified and 0 otherwise. Minimize 1 w 2 + C N 2 i=1 ξ i subject to y i w x i + b 1 ξ i, for i = 1,, N 62

SVM for Classification (Case 3) Not Linearly Separable Data; Kernel Trick Data is not linearly separable in the input space Data is linearly separable in the feature space obtained by a kernel 63

Questions? 64