CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on

Size: px

Start display at page:

Download "CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on"

Catherine Cannon
5 years ago
Views:

1 CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on Professor Wei-Min Shen Week 13.1 and

2 Status Check Extra credits? Announcement Evalua/on process will start soon Please complete your feedback! Ques/ons, Comments, Sugges/ons? 2

3 Today s Lecture Unsupervised Learning Clustering Naïve Bayes Models (e.g., Autoclasses) K-Means Clustering Algorithm The EM algorithm The Crown-Jewel of Machine Learning!!! I am so proud that you are learning this J 3

4 What You Learn in This Class? Learn AI and ML Learn how to learn by yourself EM Algorithm Machine Learning Probability Reasoning General AI Knowledge Representa/on

5 Unsupervised Learning (Clustering) Type I: Clustering the data Automa/cally group the data into clusters Type II: Parameter Learning (states are known) Learn transi/ons, sensor models, & current state Type III: Structural Learning (next lecture) States are unknown and must be automa/cally determined States are not just a symbol but may have internal structures 5

Clustering Clustering systems: Unsupervised learning

g. find categories of customers E.g. detect anomalous

Requires data, but no labels Oien get gibberish but may

6 Clustering Clustering systems: Unsupervised learning Detect pagerns in unlabeled data E.g. group s or search results E.g. find categories of customers E.g. detect anomalous execu/ons Useful when don t know what you re looking for Requires data, but no labels Oien get gibberish but may have surprisingly good results Mom, dad, strangers? 6

7 Clustering Basic idea: group together similar instances Example: 2D point pagerns What could similar mean? One op/on: small (squared) Euclidean distance 7

8 An Example of Clustering 1.0, , Given the four data points D above Cluster them in one, two, three, or four clusters? Let the hypothesis be H 1, H 2, H 3, H 4 Choose the H J such that P(H J DX) is the highest Where X is the background knowledge Bayes Model How well hypothesis explain data? How likely is the data? How likely is the hypothesis? 8

9 Different Hypotheses for Clustering (Assume Gaussian Distribu/ons) H 1 1.0, , H 2 1.0, , H 4 1.0, ,

10 Clustering Algorithms Naïve Bayes Model (AUTOCLASS) K-Means Agglomera/ve CS561 10

11 A Bayes Model Example 1: data D = freshly caught fishes, cluster J = types of fishes Example 2: data D = star observa/ons, cluster J = type of stars 11

12 A Naïve Bayes Model Naïve Naïve The Autoclass Clustering Algorithm 12

13 AUTOCLASS Algorithm 13

14 AUTOCLASS Algorithm 14

15 Back to the example P(D H 3 X) = Since P(X H 2 X) > P(X H 3 X) > P(X H 4 X) > P(X H 1 X), the Bayesian theorem tells that H 2 is the best hypothesis, which matches our intui/on perfectly! 15

16 K-Means Clustering An itera/ve clustering algorithm Pick K random points as cluster centers (means) Alternate: Assign data instances to closest mean Assign each mean to the average of its assigned points Stop when no points assignments change 16

17 K-Means Example CS561 17

18 K-Means as Op/miza/on Consider the total distance to the means: points assignments means Each itera/on reduces φ Two stages each itera/on: Update assignments: fix means c, change assignments a Update means: fix assignments a, change means c 18

19 Phase I: Update Assignments For each point, reassign to the closest mean: Can only decrease total distance φ! 19

20 Phase II: Update Means Move each mean to the average of its assigned points: Also can only decrease total distance (Why?) 20

21 Ini/aliza/on K-means is non-determinis/c Requires ini/al means It does mager what you pick! What can go wrong? Various schemes for preven/ng this kind of thing: variance-based split / merge, ini/aliza/on heuris/cs 21

22 K-Means Gewng Stuck A local op/mum: Why doesn t this work out like the earlier example, with the purple taking over half the blue? 22

23 K-Means Ques/ons Will K-means converge? To a global op/mum? Will it always find the true pagerns in the data? If the pagerns are very very clear? Will it find something interes/ng? Do people ever use it? How many clusters to pick? 23

24 Agglomera/ve Clustering Agglomera/ve clustering: First merge very similar instances Incrementally build larger clusters out of smaller clusters Algorithm: Maintain a set of clusters Ini/ally, each instance in its own cluster Repeat: Pick the two closest clusters Merge them into a new cluster Stop when there s only one cluster lei Produces not one clustering, but a family of clusters represented by a dendrogram 24

25 Agglomera/ve Clustering How should we define closest for clusters with mul/ple elements? Many op/ons Closest pair (single-link clustering) Farthest pair (complete-link clustering) Average of all pairs Ward s method (min variance, like k- means) Different choices create different clustering behaviors 25

26 Clustering Applica/on Top-level categories: supervised classifica/on Story groupings: unsupervised clustering CS561 26

27 Clustering Algorithms (Review) Naïve Bayes Model (AUTOCLASS) Select M i that has the best P(D M i ), K-Means Loop un/l no improvement Assign data to the nearest cluster Adjust clusters to fit the assignments Agglomera/ve Always merge the pair of closest clusters CS561 27

28 The EM Algorithm It is not an algorithm, it is a Framework! A loop of two phases Es/ma/on Modifica/on (Maximiza/on) For example, when we do clustering Phase 1: update assignment (data to cluster) Phase 2: update means (adjust clusters) 28

The EM Algorithm A very general framework Many

Bayesian net with hidden variables Hidden Markov

29 The EM Algorithm A very general framework Many forms & applica/ons Clustering with Gaussians Bayesian net with hidden variables Hidden Markov models (HMM) Par/ally Observable Markov Decision Process (POMDP) See e.g., ALFE 5.10 Others We describe it by form/applica/ons Clustering Learning POMDP ç 29

Current State Three key components: - Sensor

30 HMM/POMDP (A Review) 1. Ac/ons 2. Percepts (observa/ons) 3. States 4. Appearance: states à observa/ons 5. Transi/ons: (states, ac/ons) à states 6. Current State Three key components: - Sensor model θ=p(z s) - Ac/on model P(s s,a) - Current state π t (s) (localiza/on) 30

31 Ligle Prince Example {P(s3 s0,f)=.51, P(s2 s1,b)=.32, P(s4 s3,t)=.89, } {P(rose s0)=.76, P(volcano s1)=.83, P(nothing s3)=.42, } π 1 (0) = 0.25,π 2 (0) = 0.25,π 3 (0) = 0.25,π 4 (0) =

32 Learning HMM/POMDP M = (B, Z, S, P, θ, π) Task: Given B, Z, S, and an experience Improve P, θ, π. Improve means beger match the experience How do we do that? EM: Bayesian Again! P(E M): Use M to explain E Use the explana/on to improve P(M E) 32

33 Equa/ons for Improving based on explana/on Assume A and M are independent: P(A MC)=P(A C) improving explana/on 33

34 Compute Hidden State Sequence (1/2) Experience consists of both O and A O is observa/on sequence and A is ac/on sequence in experience M is the model, C is the background informa/on Ligle Prince Example: for three steps, how many possible I are there? 34

35 Compute Hidden State Sequence (2/2) O = {z 1, z 2,, z T }, A = {b 1, b 2,, b T }, I = {i 1, i 1,, i T } Among all possible sequences in I, there is one with the maximal probability. 35

36 Which Explana/on is the best? Among all possible sequences of states, the best explana/on is the sequence of states that gives the maximal value for Experience: Observa/ons: o 1, o 2,, o T Ac/ons: b 1, b 2,, b T Sensor model: Ac/on model: P ij [b] = P(s j s i, b) Explana/on: State sequence: i 1, i 2, i 3,, i T 36

37 The EM Algorithm E-Step: Es/mate P(E M) the likelihood of the experience E given the model M Using the model M to explain the experience E M-Step: Maximizing the parameters of the model M using the knowledge learned from the experience Using the explana/on to improve the model E.g., Baum-Welch Learning Procedure 37

38 Baum-Welch Learning Procedure Using the explana/on of the experience to change the model: 38

39 Update P, θ, π using α, β, γ, ξ (Smooth) Use the whole experience to determine the beginning From all s i, how many go to s j From all s i, how many look like z k (Smooth) Use the whole model to determine the next state 39

40 Ligle Prince Example Compu/ng α, β, γ, ξ using the experience E = {rose}, forward, {nothing}, forward, {volcano} S 0 S 1 S 2 S 3 S 0 S 1 S 2 S 3 S 0 S 1 S 2 S 3 States: s 0, s 1, s 2, s 3 Init state distribu/on: 0.25, 0.25, 0.25, 0.25 Transi/ons: P( S j S i, forward): Observa/ons Z=<rose, volcano, nothing> Sensor model: 40

41 Forward Procedure Compute step by step: α t (i): the probability at the state s i at /me t given E 41

Compute α values α 1 (s 3 )=0.25*0.8=0.2 α 1 (s 2 )=0.25*0.5=0.125 α 1 (s 1 )=0.25*0.2=0.05 α 1 (s 0 )=0.25*0.4=0.1 α 2 (s 3 )=(.2*.1+.125*.1+.05*.1+.1*.

42 Compute α values α 1 (s 3 )=0.25*0.8=0.2 α 1 (s 2 )=0.25*0.5=0.125 α 1 (s 1 )=0.25*0.2=0.05 α 1 (s 0 )=0.25*0.4=0.1 α 2 (s 3 )=(.2* *.1+.05*.1+.1*.4)*0.1 α 2 (s 2 )=(.2* *.1+.05*.3+.1*.2)*0.2 α 2 (s 1 )=(.2* *.3+.05*.3+.1*.1)*0.2 α 2 (s 0 )=(.2* *.5+.05*.3+.1*.3)*0.1 S 0 S 1 S 2 S 3 42

43 Compute α values α 1 (s 3 )=0.2 α 1 (s 3 )=0.125 α 1 (s 1 )=0.05 α 1 (s 0 )=0.1 α 2 (s 3 )= α 2 (s 2 )= α 2 (s 1 )=0.092 α 2 (s 0 )= α 3 (s 3 )=(.00775* * * *.4)*0.1 α 3 (s 2 )=(.00775* * * *.2)*0.3 α 3 (s 1 )=(.00775* * * *.1)*0.6 α 3 (s 0 )=(.00775* * * *.3)*0.5 43

44 Compute β t (i) by Backward Procedure β t (i): given E, the probability of being at the state s i at /me t 44

45 Compute β values S 0 S 1 S 2 S 3 β 2 (s 3 )=(.1*.1+.5*.3+.3*.6+.1*.5)*1.0 β 2 (s 2 )=(.1*.1+.1*.3+.3*.6+.5*.5)*1.0 β 2 (s 1 )=(.1*.1+.3*.3+.3*.6+.3*.5)*1.0 β 2 (s 0 )=(.4*.1+.2*.3+.1*.6+.3*.5)*1.0 β 3 (s 3 )=1.0 β 3 (s 2 )=1.0 β 3 (s 1 )=1.0 β 3 (s 0 )=1.0 45

46 Compute β values β 1 (s 3 ) = fill in here in the class β 1 (s 2 )= β 1 (s 1 )= β 1 (s 0 )= β 2 (s 3 )=0.39 β 2 (s 2 )=0.47 β 2 (s 1 )=0.43 β 2 (s 0 )=0.31 β 3 (s 3 )=1.0 β 3 (s 2 )=1.0 β 3 (s 1 )=1.0 β 3 (s 0 )=1.0 46

47 ϒ t (i) Value: Puwng α and β together ϒ t (i) is the probability of being at state s i at /me t given the en/re experience E 1:T S1 S2 S3 S4 S1 S2 S3 S4 s i t S1 S2 S3 S4 S1 S2 S3 S4 47

48 Compute γ = α*β γ 1 (s 3 ) γ 1 (s 2 ) γ 1 (s 1 ) γ 1 (s 0 ) γ 2 (s 3 ) γ 2 (s 2 ) γ 2 (s 1 ) γ 2 (s 0 ) γ 3 (s 3 ) γ 3 (s 2 ) γ 3 (s 1 ) γ 3 (s 0 ) 48

49 Puwng α and β together α 1 (s 3 )=0.2 β 1 (s 3 ) α 2 (s 3 )= β 2 (s 3 )=0.39 α 3 (s 3 )= β 3 (s 3 )=1.0 α 1 (s 3 )=0.125 β 1 (s 2 ) α 2 (s 2 )= β 2 (s 2 )=0.47 α 3 (s 2 )= β 3 (s 2 )=1.0 α 1 (s 1 )=0.05 β 1 (s 1 ) α 2 (s 1 )=0.092 β 2 (s 1 )=0.43 α 3 (s 1 )= β 3 (s 1 )=1.0 α 1 (s 0 )=0.1 β 1 (s 0 ) α 2 (s 0 )= β 2 (s 0 )=0.31 α 3 (s 0 )= β 3 (s 0 )=1.0 49

50 ξ t (i,j) Value ξ t (i,j) is the probability of making transi/on from s i to s i at /me t in the experience E 1:T S1 S2 S3 S4 S1 S2 S3 S4 s i b t s j t t+1 S1 S2 S3 S4 S1 S2 S3 S4 CS561 50

Example ξ value ξ 1 (s i,s j ) = α 11 (s (s 3 )=0.2 P(s 2 s 3 )=0.3 β 2 (s 3 )=0.39 α 3 (s 3 )=.0017925 α 1 (s 2 )=0.125 θ s2 (nth)=0.

51 Example ξ value ξ 1 (s i,s j ) = α 11 (s (s 3 )=0.2 P(s 2 s 3 )=0.3 β 2 (s 3 )=0.39 α 3 (s 3 )= α 1 (s 2 )=0.125 θ s2 (nth)=0.2 β 2 (s 2 (s 2 )= )=0.47 α 3 (s 2 )= α 1 (s 1 )=0.05 β 2 (s 1 )=0.43 α 3 (s 1 )= α 1 (s 0 )=0.1 β 2 (s 0 )=0.31 α 3 (s 0 )=

52 Improve the model by explana/on Update P, θ, π using α, β, γ, ξ Use the whole experience to determine the beginning From all s i in E, how many go to s j From all s i in E, how many appear as z k Use the whole experience to determine the distribu/on for the next state 52

53 The General EM Algorithm E-Step: Es/mate P(E M) the likelihood of the experience E given the model M E.g., compu/ng α, β, γ, ξ using the experience K-means: assigning data to the (closest) clusters M-Step: Maximize the parameters of the model M using the knowledge (e.g., explana/ons) learned from the experience E.g., update P, θ, π using α, β, γ, ξ K-means: move the clusters based on the assignments 53

54 Comments on EM The most general and powerful learning method Many exis/ng algorithms are special cases of EM Tremendous applica/on poten/als Robot naviga/on, localiza/on, mapping, SLAM, manipula/on, planning, etc. Natural language processes (IBM s Watson) Data Mining Gaming that can improve themselves Discovering pagerns from gene/c and health data 54

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on. Professor Wei-Min Shen Week 8.1 and 8.2

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on Professor Wei-Min Shen Week 8.1 and 8.2 Status Check Projects Project 2 Midterm is coming, please do your homework!