Bayesian Networks (Part I)

Size: px

Start display at page:

Download "Bayesian Networks (Part I)"

Kristian Anderson
5 years ago
Views:

1 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Bayesian Networks (Part I) Graphical Model Readings: Murphy Bishop 8.1, HTF - - Mitchell 6.11 Matt Gormley Lecture 22 April 10,

2 Reminders Peer Tutoring Homework 7: Deep Learning Release: Wed, Apr. 05 Part I due Wed, Apr. 12 Start Early Part II due Mon, Apr. 17 2

3 CONVOLUTIONAL NEURAL NETS 3

4 Deep Learning Outline Background: Computer Vision Image Classification ILSVRC Traditional Feature Extraction Methods Convolution as Feature Extraction Convolutional Neural Networks (CNNs) Learning Feature Abstractions Common CNN Layers: Convolutional Layer Max- Pooling Layer Fully- connected Layer (w/tensor input) Softmax Layer ReLU Layer Background: Subgradient Architecture: LeNet Architecture: AlexNet Training a CNN SGD for CNNs Backpropagation for CNNs 4

5 Convolutional Neural Network (CNN) Typical layers include: Convolutional layer Max- pooling layer Fully- connected (Linear) layer ReLUlayer (or some other nonlinear activation function) Softmax These can be arranged into arbitrarily deep topologies Architecture #1: LeNet- 5 5

0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0

6 Input Image Convolutional Layer CNN key idea: Treat convolution matrix as parameters and learn them! Learned Convolution θ 11 θ 12 θ 13 θ 21 θ 22 θ 23 θ 31 θ 32 θ 33 Convolved Image

7 Downsampling by Averaging Downsampling by averaging used to be a common approach This is a special case of convolution where the weights are fixed to a uniform distribution The example below uses a stride of 2 Input Image Convolution 1/4 1/4 1/4 1/4 Convolved Image 3/4 3/4 1/4 3/4 1/4 0 1/

Max- Pooling Max- pooling is another (common) form of downsampling Instead of averaging, we take the max value within the same range as the equivalently- sized convolution The example

8 Max- Pooling Max- pooling is another (common) form of downsampling Instead of averaging, we take the max value within the same range as the equivalently- sized convolution The example below uses a stride of 2 Input Image Max- pooling x i,j x i+1,j x i,j+1 x i+1,j+1 Max- Pooled Image

9 Multi- Class Output Output Hidden Layer Input 10

10 Multi- Class Output Softmax Layer: y k = (b k ) K l=1 (b l) (F) Loss J = K k=1 y k (y k) (E) Output (softmax) y k = (b k) K l=1 (b l) (D) Output (linear) b k = D j=0 kjz j k Output (C) Hidden (nonlinear) z j = (a j ), j Hidden Layer (B) Hidden (linear) a j = M i=0 jix i, j Input (A) Input Given x i, i 11

11 Whiteboard Training a CNN SGD for CNNs Backpropagation for CNNs 12

12 Whiteboard Common CNN Layers ReLU Layer Background: Subgradient Fully- connected Layer (w/tensor input) Softmax Layer Convolutional Layer Max- Pooling Layer 13

13 Convolutional Layer 14

14 Convolutional Layer 15

15 Max- Pooling Layer 16

16 Max- Pooling Layer 17

17 Convolutional Neural Network (CNN) Typical layers include: Convolutional layer Max- pooling layer Fully- connected (Linear) layer ReLUlayer (or some other nonlinear activation function) Softmax These can be arranged into arbitrarily deep topologies Architecture #1: LeNet- 5 18

18 Architecture #2: AlexNet CNN for Image Classification (Krizhevsky, Sutskever & Hinton, 2012) 15.3% error on ImageNet LSVRC contest Input image (pixels) Five convolutional layers (w/max- pooling) Three fully connected layers way softmax 19

19 CNNs for Image Recognition Slide from Kaiming He 20

20 Mini- Batch SGD Gradient Descent: Compute true gradient exactly from all N examples Mini- Batch SGD: Approximate true gradient by the average gradient of K randomly chosen examples Stochastic Gradient Descent (SGD): Approximate true gradient by the gradient of one randomly chosen example 21

21 Mini- Batch SGD Three variants of first- order optimization: 22

22 CNN VISUALIZATIONS 23

23 3D Visualization of CNN

24 Convolution of a Color Image Color images consist of 3 floats per pixel for RGB (red, green blue) color values Convolution must also be 3- dimensional 32 32x32x3 image 5x5x3 filter activation map 28 convolve (slide) over all spatial locations 3 32 Figure from Fei- Fei Li & Andrej Karpathy & Justin Johnson (CS231N)

Animation of 3D Convolution http://cs231n.github.

25 Animation of 3D Convolution networks/ Figure from Fei- Fei Li & Andrej Karpathy & Justin Johnson (CS231N) 26

26 MNIST Digit Recognition with CNNs (in your browser) Figure from Andrej Karpathy 27

27 CNN Summary CNNs Are used for all aspects of computer vision, and have won numerous pattern recognition competitions Able learn interpretable features at different levels of abstraction Typically, consist of convolution layers, pooling layers, nonlinearities, and fully connected layers Other Resources: Readings on course website Andrej Karpathy, CS231n Notes networks/ 28

28 BAYESIAN NETWORKS 29

29 Bayes Nets Outline Motivation Structured Prediction Background Conditional Independence Chain Rule of Probability Directed Graphical Models Writing Joint Distributions Definition: Bayesian Network Qualitative Specification Quantitative Specification Familiar Models as Bayes Nets Conditional Independence in Bayes Nets Three case studies D- separation Markov blanket Learning Fully Observed Bayes Net (Partially Observed Bayes Net) Inference Sampling directly from the joint distribution Gibbs Sampling 31

30 MOTIVATION: STRUCTURED PREDICTION 32

31 Structured Prediction Most of the models we ve seen so far were for classification Given observations: x = (x 1, x 2,, x K ) Predict a (binary) label: y Many real- world problems require structured prediction Given observations: x = (x 1, x 2,, x K ) Predict a structure: y = (y 1, y 2,, y J ) Some classification problems benefit from latent structure 33

32 Structured Prediction Examples Examples of structured prediction Part- of- speech (POS) tagging Handwriting recognition Speech recognition Word alignment Congressional voting Examples of latent structure Object recognition 34

33 Dataset for Supervised Part- of- Speech (POS) Tagging Data: D = {x (n), y (n) } N n=1 Sample 1: Sample 2: Sample 3: Sample 4: n v p d n time n n v d n time flies like an arrow flies like an arrow n v p n n flies p n n v v with fly with their wings time you will see y (1) x (1) y (2) x (2) y (3) x (3) y (4) x (4) 35

34 Data: Sample 1: Dataset for Supervised Handwriting Recognition D = {x (n), y (n) } N n=1 u n e x p e c t e d total of 1,63 words, and its part of s speech label y (1) 1. First- 2. Four x (1) Hand Sample 2: v o l c a n i c y (2) x (2) Sample 2: e m b r a c e s y (3) x (3) Fig. 5. Handwriting recognition: Example words from the dataset used. Figures from (Chatzis & Demiris, 2013) 36

35 Dataset for Supervised Phoneme (Speech) Recognition Data: D = {x (n), y (n) } N n=1 Sample 1: h# dh ih s w uh z iy z iy y (1) x (1) Sample 2: f ao r ah s s h# y (2) x (2) Figures from (Jansen & Niyogi, 2013) 37

Interactions: Word fertilities Few jumps (discontinuities) Syntactic

36 Application: Word Alignment / Phrase Extraction Variables (boolean): For each (Chinese phrase, English phrase) pair, are they linked? Interactions: Word fertilities Few jumps (discontinuities) Syntactic reorderings ITG contraint on alignment Phrases are disjoint (?) (Burkett & Klein, 2012) 38

representatives Interactions: Words used by representative and their

37 Application: Congressional Voting Variables: Text of all speeches of a representative Local contexts of references between two representatives Interactions: Words used by representative and their vote Pairs of representatives and their local context (Stoyanov & Eisner, 2012) 39

38 Structured Prediction Examples Examples of structured prediction Part- of- speech (POS) tagging Handwriting recognition Speech recognition Word alignment Congressional voting Examples of latent structure Object recognition 40

39 Case Study: Object Recognition Data consists of images x and labels y. x(2) x(1) pigeon y(1) rhinoceros x(3) leopard y(3) y(2) x(4) llama y(4) 41

40 Case Study: Object Recognition Data consists of images x and labels y. Preprocess data into patches Posit a latent labeling z describing the object s parts (e.g. head, leg, tail, torso, grass) Define graphical model with these latent variables in mind z is not observed at train or test time leopard 42

41 Case Study: Object Recognition Data consists of images x and labels y. Preprocess data into patches Posit a latent labeling z describing the object s parts (e.g. head, leg, tail, torso, grass) Z7 X7 Define graphical model with these latent variables in mind z is not observed at train or test time Z2 X2 leopard Y 43

42 Case Study: Object Recognition Data consists of images x and labels y. Preprocess data into patches Posit a latent labeling z describing the object s parts (e.g. head, leg, tail, torso, grass) Z7 ψ4 ψ1 X7 Define graphical model with these latent variables in mind z is not observed at train or test time ψ4 ψ2 Z2 ψ4 ψ3 X2 leopard Y 44

43 Structured Prediction Preview of challenges to come Consider the task of finding the most probable assignment to the output Classification ŷ = p(y ) y where y {+1, 1} Structured Prediction ˆ = p( ) where Y and Y is very large 45

44 Machine Learning The data inspires the structures we want to predict Inference finds {best structure, marginals, partition function} for a new observation (Inference is usually called as a subroutine in learning) Domain Knowledge Combinatorial Optimization ML Mathematical Modeling Optimization Our model defines a score for each structure It also tells us what to optimize Learning tunes the parameters of the model 46

3 Alice saw Bob on a Machine Learning Alice saw Bob on

arrow Model X 1 time flies like an arrow X2 X 3 time

X 4 X 5 Inference time flies like an arrow Learning 2

45 3 Alice saw Bob on a Machine Learning Alice saw Bob on a hill with a telescope Data 4 time flies like an arrow Model X 1 time flies like an arrow X2 X 3 time flies like an arrow time flies like an arrow Objective X 4 X 5 Inference time flies like an arrow Learning 2 (Inference is usually called as a subroutine in learning) 47

46 BACKGROUND 48

47 Whiteboard Background Chain Rule of Probability Conditional Independence 49

48 Background: Chain Rule of Probability For random variables A and B: P (A, B) =P (A B)P (B) For random variables X 1,X 2,X 3,X 4 : P (X 1,X 2,X 3,X 4 )=P(X 1 X 2,X 3,X 4 ) P (X 2 X 3,X 4 ) P (X 3 X 4 ) P (X 4 ) 50

49 Background: Conditional Independence Random variables A and B are conditionally independent given C if: P (A, B C) =P (A C)P (B C) (1) or equivalently: P (A B,C) =P (A C) (2) We write this as: A B C (3) Later we will also write: I<A, {C}, B> 51

50 Bayesian Networks DIRECTED GRAPHICAL MODELS 52

51 Example: Tornado Alarms 1. Imagine that you work at the 911 call center in Dallas 2. You receive six calls informing you that the Emergency Weather Sirens are going off 3. What do you conclude? Figure from emergency- sirens- hacking.html 53

52 Example: Tornado Alarms 1. Imagine that you work at the 911 call center in Dallas 2. You receive six calls informing you that the Emergency Weather Sirens are going off 3. What do you conclude? Figure from emergency- sirens- hacking.html 54

53 Whiteboard Directed Graphical Models (Bayes Nets) Example: Tornado Alarms Writing Joint Distributions Idea #1: Giant Table Idea #2: Rewrite using chain rule Idea #3: Assume full independence Idea #4: Drop variables from RHS of conditionals Definition: Bayesian Network Observed Variables in Graphical Models 55

54 Bayesian Network X2 X 1 X 3 X 4 X 5 p(x 1,X 2,X 3,X 4,X 5 )= p(x 5 X 3 )p(x 4 X 2,X 3 ) p(x 3 )p(x 2 X 1 )p(x 1 ) 56

55 Bayesian Network Definition: X 1 X2 X 3 n i=1 P(X 1 X n ) = P(X i parents(x i )) X 4 X 5 A Bayesian Network is a directed graphical model It consists of a graph G and the conditional probabilities P These two parts full specify the distribution: Qualitative Specification: G Quantitative Specification: P 57

Hidden Markov Models

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework