Classification: Probabilistic Generative Model

Size: px

Start display at page:

Download "Classification: Probabilistic Generative Model"

Kory Wilkerson
5 years ago
Views:

1 Classification: Probabilistic Generative Model

Classification x Function Class n Credit Scoring Input: income, savings, profession, age, past financial history Output: accept or refuse Medical Diagnosis Input: current

2 Classification x Function Class n Credit Scoring Input: income, savings, profession, age, past financial history Output: accept or refuse Medical Diagnosis Input: current symptoms, age, gender, past medical history Output: which kind of diseases Handwritten character recognition Input: Face recognition Input: image of a face, output: person output: 金

3 Example Application f = f = f =

4 Example Application pokemon games (NOT pokemon cards or Pokemon Go) HP: hit points, or health, defines how much damage a pokemon can withstand before fainting 35 Attack: the base modifier for normal attacks (eg. Scratch, Punch) 55 Defense: the base damage resistance against normal attacks SP Atk: special attack, the base modifier for special attacks (e.g. fire blast, bubble beam) 50 SP Def: the base damage resistance against special attacks Speed: determines which pokemon attacks first each round Can we predict the type of pokemon based on the information?

5 Example Application

6 Ideal Alternatives Function (Model): x Loss function: g x > 0 else f x Output = class 1 Output = class 2 L f = n δ f x n y n The number of times f get incorrect results on training data. Find the best function: Example: Perceptron, SVM Not Today

7 How to do Classification Training data for Classification x 1, y 1 x 2, y 2 x N, y N Classification as Regression? Binary classification as example Training: Class 1 means the target is 1; Class 2 means the target is -1 Testing: closer to 1 class 1; closer to -1 class 2

8 b + w 1 x 1 + w 2 x 2 = 0 to decrease error Class 2 Class x 2 Class 1 x 2 Class 1 >>1 y = b + w 1 x 1 + w 2 x 2 error x 1 x 1 Penalize to the examples that are too correct (Bishop, P186) Multiple class: Class 1 means the target is 1; Class 2 means the target is 2; Class 3 means the target is 3 problematic

9 Two Boxes Box 1 Box 2 P(B 1 ) = 2/3 P(B 2 ) = 1/3 P(Blue B 1 ) = 4/5 P(Green B 1 ) = 1/5 P(Blue B 1 ) = 2/5 P(Green B 1 ) = 3/5 from one of the boxes Where does it come from? P(B 1 Blue) = P Blue B 1 P B 1 P Blue B 1 P B 1 + P Blue B 2 P B 2

10 Two Classes Estimating the Probabilities From training data Class 1 Class 2 P(C 1 ) P(C 2 ) P(x C 1 ) P(x C 2 ) Given an x, which class does it belong to P C 1 x = P x C 1 P C 1 P x C 1 P C 1 + P x C 2 P C 2 Generative Model P x = P x C 1 P C 1 + P x C 2 P C 2

11 Prior Class 1 Class 2 P(C 1 ) P(C 2 ) Water Water and Normal type with ID < 400 for training, rest for testing Training: 79 Water, 61 Normal Normal P(C 1 ) = 79 / ( ) =0.56 P(C 2 ) = 61 / ( ) =0.44

12 Probability from Class P(x C 1 ) =? P( Water) =? Each Pokémon is represented as a vector by its attribute. feature 79 in total Water Type

13 Probability from Class - Feature Considering Defense and SP Defense x 1, x 2, x 3,,x Water Type Assume the points are sampled from a Gaussian distribution. P(x Water)=? 0?

14 Gaussian Distribution f μ,σ x = π D/2 Σ 1/2 exp 1 2 x μ T Σ 1 x μ Input: vector x, output: probability of sampling x The shape of the function determines by mean μ and covariance matrix Σ μ = 0 0 = μ = 2 3 = x 2 x x 2 1 x 1

Gaussian Distribution f μ,σ x = https://blog.slinuxer.

15 Gaussian Distribution f μ,σ x = π D/2 Σ 1/2 exp 1 2 x μ T Σ 1 x μ Input: vector x, output: probability of sampling x The shape of the function determines by mean μ and covariance matrix Σ μ = 0 0 = μ = 0 0 = x 2 x x 2 1 x 1

16 Probability from Class Assume the points are sampled from a Gaussian distribution Find the Gaussian distribution behind them New x x 1, x 2, x 3,,x 79 μ = Σ = Probability for new points How to find them? Water Type f μ,σ x = 1 1 2π D/2 Σ 1/2 exp 1 2 x μ T Σ 1 x μ

17 Maximum Likelihood f μ,σ x = 1 1 2π D/2 Σ 1/2 exp 1 2 x μ T Σ 1 x μ = x 1, x 2, x 3,,x 79 μ = μ = = The Gaussian with any mean μ and covariance matrix Σ can generate these points. Different Likelihood Likelihood of a Gaussian with mean μ and covariance matrix Σ = the probability of the Gaussian samples x 1, x 2, x 3,,x 79 L μ, Σ = f μ,σ x 1 f μ,σ x 2 f μ,σ x 3 f μ,σ x 79

18 Maximum Likelihood We have the Water type Pokémons: x 1, x 2, x 3,,x 79 We assume x 1, x 2, x 3,,x 79 generate from the Gaussian (μ, Σ ) with the maximum likelihood L μ, Σ = f μ,σ x 1 f μ,σ x 2 f μ,σ x 3 f μ,σ x 79 f μ,σ x = 1 1 2π D/2 Σ 1/2 exp 1 2 x μ T Σ 1 x μ μ, Σ = arg max μ,σ L μ, Σ μ = 1 79 n=1 x n Σ = 1 79 n=1 x n μ x n μ T average

19 Maximum Likelihood Class 1: Water Class 2: Normal μ 1 = Σ 1 = μ 2 = Σ 2 =

20 Now we can do classification f μ 1,Σ 1 x = 1 2π D/2 1 Σ 1 1/2 exp 1 2 x μ1 T Σ 1 1 x μ 1 μ 1 = Σ 1 = P(C1) = 79 / ( ) =0.56 P C 1 x = P x C 1 P C 1 P x C 1 P C 1 + P x C 2 P C 2 f μ 2,Σ 2 x = 1 2π D/2 1 Σ 2 1/2 exp 1 2 x μ2 T Σ 2 1 x μ 2 μ 2 = Σ 2 = P(C2) = 61 / ( ) =0.44 If P C 1 x > 0.5 x belongs to class 1 (Water)

21 P C 1 x > 0.5 P C 1 x Blue points: C 1 (Water), Red points: C 2 (Normal) How s the results? Testing data: 47% accuracy All: hp, att, sp att, de, sp de, speed (6 features) μ 1, μ 2 : 6-dim vector Σ 1, Σ 2 : 6 x 6 matrices 64% accuracy P C 1 x < 0.5 P C 1 x < 0.5 P C 1 x > 0.5

22 Modifying Model Class 1: Water Class 2: Normal μ 1 = Σ 1 = μ 2 = Σ 2 = The same Σ Less parameters

23 Modifying Model Ref: Bishop, chapter Maximum likelihood Water type Pokémons: x 1, x 2, x 3,,x 79 Normal type Pokémons: x 80, x 81, x 82,,x 140 μ 1 Σ μ 2 Find μ 1, μ 2, Σ maximizing the likelihood L μ 1,μ 2,Σ L μ 1,μ 2,Σ = f μ 1,Σ x1 f μ 1,Σ x2 f μ 1,Σ x79 f μ 2,Σ x80 f μ 2,Σ x81 f μ 2,Σ x140 μ 1 and μ 2 is the same Σ = Σ Σ2

24 Modifying Model The boundary is linear The same covariance matrix All: hp, att, sp att, de, sp de, speed 54% accuracy 73% accuracy

25 Three Steps Function Set (Model): x P C 1 x = P x C 1 P C 1 P x C 1 P C 1 + P x C 2 P C 2 If P C 1 x > 0.5, output: class 1 Otherwise, output: class 2 Goodness of a function: The mean μ and covariance Σ that maximizing the likelihood (the probability of generating data) Find the best function: easy

26 Probability Distribution You can always use the distribution you like P x C 1 = P x 1 C 1 P x 2 C 1 P x k C 1 x 1 x 2 x k x K 1-D Gaussian For binary features, you may assume they are from Bernoulli distributions. If you assume all the dimensions are independent, then you are using Naive Bayes Classifier.

27 Posterior Probability P C 1 x = P x C 1 P C 1 P x C 1 P C 1 + P x C 2 P C 2 = P x C = 2 P C 2 P x C 1 P C 1 z = ln P x C 1 P C 1 P x C 2 P C exp z z z = σ z Sigmoid function

28 Warning of Math

29 Posterior Probability P C 1 x = σ z z = ln P x C 1 P C 1 sigmoid P x C 2 P C 2 z = ln P x C 1 P x C 2 + ln P C 1 P C 2 N 1 N 1 + N 2 = N 2 N 1 + N 2 N 1 N 2 P x C 1 = P x C 2 = 1 1 2π D/2 Σ 1 1/2 exp 1 2 x μ1 T Σ 1 1 x μ π D/2 Σ 2 1/2 exp 1 2 x μ2 T Σ 2 1 x μ 2

30 ln = ln z = ln P x C 1 + ln P C 1 = N 1 P x C 2 P C 2 N P x C 1 = 2π D/2 Σ 1 1/2 exp 1 2 x μ1 T Σ 1 1 x μ 1 P x C 2 = 1 1 2π D/2 Σ 1 1/2 exp 1 2 x μ1 T Σ 1 1 x μ π D/2 Σ 2 1/2 exp 1 2 x μ2 T Σ 2 1 x μ 2 Σ2 1/ π D/2 Σ 2 1/2 exp 1 2 x μ2 T Σ 2 1 x μ 2 Σ 1 1/2 exp 1 2 x μ1 T Σ 1 1 x μ 1 x μ 2 T Σ 2 1 x = ln Σ2 1/2 Σ 1 1/2 1 2 x μ1 T Σ 1 1 x μ 1 x μ 2 T Σ 2 1 x μ 2

31 ln z = ln P x C 1 P x C 2 + ln P C 1 P C 2 = N 1 N 2 Σ2 1/2 Σ 1 1/2 1 2 x μ1 T Σ 1 1 x μ 1 x μ 2 T Σ 2 1 x μ 2 x μ 1 T Σ 1 1 x μ 1 = x T Σ 1 1 x x T Σ 1 1 μ 1 μ 1 T Σ 1 1 x + μ 1 T Σ 1 1 μ 1 = x T Σ 1 1 x 2 μ 1 T Σ 1 1 x + μ 1 T Σ 1 1 μ 1 x μ 2 T Σ 2 1 x μ 2 = x T Σ 2 1 x 2 μ 2 T Σ 2 1 x + μ 2 T Σ 2 1 μ 2 z = ln Σ2 1/2 Σ 1 1/2 1 2 xt Σ 1 1 x + μ 1 T Σ 1 1 x 1 2 μ1 T Σ 1 1 μ xt Σ 2 1 x μ 2 T Σ 2 1 x μ2 T Σ 2 1 μ 2 +ln N 1 N 2

32 End of Warning

33 P C 1 x = σ z z = ln Σ2 1/2 Σ 1 1/2 1 2 xt Σ 1 1 x + μ 1 T Σ 1 1 x 1 2 μ1 T Σ 1 1 μ xt Σ 2 1 x μ 2 T Σ 2 1 x μ2 T Σ 2 1 μ 2 +ln N 1 N 2 Σ 1 = Σ 2 = Σ z = μ 1 μ 2 T Σ 1 x 1 2 μ1 T Σ 1 μ μ2 T Σ 1 μ 2 + ln N 1 w T b N 2 P C 1 x = σ w x + b How about directly find w and b? In generative model, we estimate N 1, N 2, μ 1, μ 2, Σ Then we have w and b

34 Reference Bishop: Chapter Data: Useful posts: kemon/pokemon-speed-attack-hp-defense-analysis-bytype astering-pokebars/discussion /visualizing-pok-mon-stats-with-seaborn/discussion

35 Acknowledgment 感謝江貫榮同學發現課程網頁上的日期錯誤感謝范廷瀚同學提供寶可夢的 domain knowledge 感謝 Victor Chen 發現投影片上的打字錯誤

Deep Learning. Hung-yi Lee 李宏毅

Deep Learning. Hung-yi Lee 李宏毅 Deep Learning Hung-yi Lee 李宏毅 Deep learning attracts lots of attention. I believe you have seen lots of exciting results before. Deep learning trends at Google. Source: SIGMOD 206/Jeff Dean 958: Perceptron