10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Bayesian Networks (Part I) Graphical Model Readings: Murphy 10 10.2.1 Bishop 8.1, 8.2.2 HTF - - Mitchell 6.11 Matt Gormley Lecture 22 April 10, 2017 1
Reminders Peer Tutoring Homework 7: Deep Learning Release: Wed, Apr. 05 Part I due Wed, Apr. 12 Start Early Part II due Mon, Apr. 17 2
CONVOLUTIONAL NEURAL NETS 3
Deep Learning Outline Background: Computer Vision Image Classification ILSVRC 2010-2016 Traditional Feature Extraction Methods Convolution as Feature Extraction Convolutional Neural Networks (CNNs) Learning Feature Abstractions Common CNN Layers: Convolutional Layer Max- Pooling Layer Fully- connected Layer (w/tensor input) Softmax Layer ReLU Layer Background: Subgradient Architecture: LeNet Architecture: AlexNet Training a CNN SGD for CNNs Backpropagation for CNNs 4
Convolutional Neural Network (CNN) Typical layers include: Convolutional layer Max- pooling layer Fully- connected (Linear) layer ReLUlayer (or some other nonlinear activation function) Softmax These can be arranged into arbitrarily deep topologies Architecture #1: LeNet- 5 5
Input Image Convolutional Layer CNN key idea: Treat convolution matrix as parameters and learn them! 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 Learned Convolution θ 11 θ 12 θ 13 θ 21 θ 22 θ 23 θ 31 θ 32 θ 33 Convolved Image.4.5.5.5.4.4.2.3.6.3.5.4.4.2.1.5.6.2.1 0.4.3.1 0 0 6
Downsampling by Averaging Downsampling by averaging used to be a common approach This is a special case of convolution where the weights are fixed to a uniform distribution The example below uses a stride of 2 Input Image 1 1 1 1 1 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 Convolution 1/4 1/4 1/4 1/4 Convolved Image 3/4 3/4 1/4 3/4 1/4 0 1/4 0 0 7
Max- Pooling Max- pooling is another (common) form of downsampling Instead of averaging, we take the max value within the same range as the equivalently- sized convolution The example below uses a stride of 2 Input Image 1 1 1 1 1 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 Max- pooling x i,j x i+1,j x i,j+1 x i+1,j+1 Max- Pooled Image 1 1 1 1 1 0 1 0 0 8
Multi- Class Output Output Hidden Layer Input 10
Multi- Class Output Softmax Layer: y k = (b k ) K l=1 (b l) (F) Loss J = K k=1 y k (y k) (E) Output (softmax) y k = (b k) K l=1 (b l) (D) Output (linear) b k = D j=0 kjz j k Output (C) Hidden (nonlinear) z j = (a j ), j Hidden Layer (B) Hidden (linear) a j = M i=0 jix i, j Input (A) Input Given x i, i 11
Whiteboard Training a CNN SGD for CNNs Backpropagation for CNNs 12
Whiteboard Common CNN Layers ReLU Layer Background: Subgradient Fully- connected Layer (w/tensor input) Softmax Layer Convolutional Layer Max- Pooling Layer 13
Convolutional Layer 14
Convolutional Layer 15
Max- Pooling Layer 16
Max- Pooling Layer 17
Convolutional Neural Network (CNN) Typical layers include: Convolutional layer Max- pooling layer Fully- connected (Linear) layer ReLUlayer (or some other nonlinear activation function) Softmax These can be arranged into arbitrarily deep topologies Architecture #1: LeNet- 5 18
Architecture #2: AlexNet CNN for Image Classification (Krizhevsky, Sutskever & Hinton, 2012) 15.3% error on ImageNet LSVRC- 2012 contest Input image (pixels) Five convolutional layers (w/max- pooling) Three fully connected layers 1000- way softmax 19
CNNs for Image Recognition Slide from Kaiming He 20
Mini- Batch SGD Gradient Descent: Compute true gradient exactly from all N examples Mini- Batch SGD: Approximate true gradient by the average gradient of K randomly chosen examples Stochastic Gradient Descent (SGD): Approximate true gradient by the gradient of one randomly chosen example 21
Mini- Batch SGD Three variants of first- order optimization: 22
CNN VISUALIZATIONS 23
3D Visualization of CNN http://scs.ryerson.ca/~aharley/vis/conv/
Convolution of a Color Image Color images consist of 3 floats per pixel for RGB (red, green blue) color values Convolution must also be 3- dimensional 32 32x32x3 image 5x5x3 filter activation map 28 convolve (slide) over all spatial locations 3 32 Figure from Fei- Fei Li & Andrej Karpathy & Justin Johnson (CS231N) 1 28 25
Animation of 3D Convolution http://cs231n.github.io/convolutional- networks/ Figure from Fei- Fei Li & Andrej Karpathy & Justin Johnson (CS231N) 26
MNIST Digit Recognition with CNNs (in your browser) https://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html Figure from Andrej Karpathy 27
CNN Summary CNNs Are used for all aspects of computer vision, and have won numerous pattern recognition competitions Able learn interpretable features at different levels of abstraction Typically, consist of convolution layers, pooling layers, nonlinearities, and fully connected layers Other Resources: Readings on course website Andrej Karpathy, CS231n Notes http://cs231n.github.io/convolutional- networks/ 28
BAYESIAN NETWORKS 29
Bayes Nets Outline Motivation Structured Prediction Background Conditional Independence Chain Rule of Probability Directed Graphical Models Writing Joint Distributions Definition: Bayesian Network Qualitative Specification Quantitative Specification Familiar Models as Bayes Nets Conditional Independence in Bayes Nets Three case studies D- separation Markov blanket Learning Fully Observed Bayes Net (Partially Observed Bayes Net) Inference Sampling directly from the joint distribution Gibbs Sampling 31
MOTIVATION: STRUCTURED PREDICTION 32
Structured Prediction Most of the models we ve seen so far were for classification Given observations: x = (x 1, x 2,, x K ) Predict a (binary) label: y Many real- world problems require structured prediction Given observations: x = (x 1, x 2,, x K ) Predict a structure: y = (y 1, y 2,, y J ) Some classification problems benefit from latent structure 33
Structured Prediction Examples Examples of structured prediction Part- of- speech (POS) tagging Handwriting recognition Speech recognition Word alignment Congressional voting Examples of latent structure Object recognition 34
Dataset for Supervised Part- of- Speech (POS) Tagging Data: D = {x (n), y (n) } N n=1 Sample 1: Sample 2: Sample 3: Sample 4: n v p d n time n n v d n time flies like an arrow flies like an arrow n v p n n flies p n n v v with fly with their wings time you will see y (1) x (1) y (2) x (2) y (3) x (3) y (4) x (4) 35
Data: Sample 1: Dataset for Supervised Handwriting Recognition D = {x (n), y (n) } N n=1 u n e x p e c t e d total of 1,63 words, and its part of s speech label y (1) 1. First- 2. Four x (1) Hand Sample 2: v o l c a n i c y (2) x (2) Sample 2: e m b r a c e s y (3) x (3) Fig. 5. Handwriting recognition: Example words from the dataset used. Figures from (Chatzis & Demiris, 2013) 36
Dataset for Supervised Phoneme (Speech) Recognition Data: D = {x (n), y (n) } N n=1 Sample 1: h# dh ih s w uh z iy z iy y (1) x (1) Sample 2: f ao r ah s s h# y (2) x (2) Figures from (Jansen & Niyogi, 2013) 37
Application: Word Alignment / Phrase Extraction Variables (boolean): For each (Chinese phrase, English phrase) pair, are they linked? Interactions: Word fertilities Few jumps (discontinuities) Syntactic reorderings ITG contraint on alignment Phrases are disjoint (?) (Burkett & Klein, 2012) 38
Application: Congressional Voting Variables: Text of all speeches of a representative Local contexts of references between two representatives Interactions: Words used by representative and their vote Pairs of representatives and their local context (Stoyanov & Eisner, 2012) 39
Structured Prediction Examples Examples of structured prediction Part- of- speech (POS) tagging Handwriting recognition Speech recognition Word alignment Congressional voting Examples of latent structure Object recognition 40
Case Study: Object Recognition Data consists of images x and labels y. x(2) x(1) pigeon y(1) rhinoceros x(3) leopard y(3) y(2) x(4) llama y(4) 41
Case Study: Object Recognition Data consists of images x and labels y. Preprocess data into patches Posit a latent labeling z describing the object s parts (e.g. head, leg, tail, torso, grass) Define graphical model with these latent variables in mind z is not observed at train or test time leopard 42
Case Study: Object Recognition Data consists of images x and labels y. Preprocess data into patches Posit a latent labeling z describing the object s parts (e.g. head, leg, tail, torso, grass) Z7 X7 Define graphical model with these latent variables in mind z is not observed at train or test time Z2 X2 leopard Y 43
Case Study: Object Recognition Data consists of images x and labels y. Preprocess data into patches Posit a latent labeling z describing the object s parts (e.g. head, leg, tail, torso, grass) Z7 ψ4 ψ1 X7 Define graphical model with these latent variables in mind z is not observed at train or test time ψ4 ψ2 Z2 ψ4 ψ3 X2 leopard Y 44
Structured Prediction Preview of challenges to come Consider the task of finding the most probable assignment to the output Classification ŷ = p(y ) y where y {+1, 1} Structured Prediction ˆ = p( ) where Y and Y is very large 45
Machine Learning The data inspires the structures we want to predict Inference finds {best structure, marginals, partition function} for a new observation (Inference is usually called as a subroutine in learning) Domain Knowledge Combinatorial Optimization ML Mathematical Modeling Optimization Our model defines a score for each structure It also tells us what to optimize Learning tunes the parameters of the model 46
3 Alice saw Bob on a Machine Learning Alice saw Bob on a hill with a telescope Data 4 time flies like an arrow Model X 1 time flies like an arrow X2 X 3 time flies like an arrow time flies like an arrow Objective X 4 X 5 Inference time flies like an arrow Learning 2 (Inference is usually called as a subroutine in learning) 47
BACKGROUND 48
Whiteboard Background Chain Rule of Probability Conditional Independence 49
Background: Chain Rule of Probability For random variables A and B: P (A, B) =P (A B)P (B) For random variables X 1,X 2,X 3,X 4 : P (X 1,X 2,X 3,X 4 )=P(X 1 X 2,X 3,X 4 ) P (X 2 X 3,X 4 ) P (X 3 X 4 ) P (X 4 ) 50
Background: Conditional Independence Random variables A and B are conditionally independent given C if: P (A, B C) =P (A C)P (B C) (1) or equivalently: P (A B,C) =P (A C) (2) We write this as: A B C (3) Later we will also write: I<A, {C}, B> 51
Bayesian Networks DIRECTED GRAPHICAL MODELS 52
Example: Tornado Alarms 1. Imagine that you work at the 911 call center in Dallas 2. You receive six calls informing you that the Emergency Weather Sirens are going off 3. What do you conclude? Figure from https://www.nytimes.com/2017/04/08/us/dallas- emergency- sirens- hacking.html 53
Example: Tornado Alarms 1. Imagine that you work at the 911 call center in Dallas 2. You receive six calls informing you that the Emergency Weather Sirens are going off 3. What do you conclude? Figure from https://www.nytimes.com/2017/04/08/us/dallas- emergency- sirens- hacking.html 54
Whiteboard Directed Graphical Models (Bayes Nets) Example: Tornado Alarms Writing Joint Distributions Idea #1: Giant Table Idea #2: Rewrite using chain rule Idea #3: Assume full independence Idea #4: Drop variables from RHS of conditionals Definition: Bayesian Network Observed Variables in Graphical Models 55
Bayesian Network X2 X 1 X 3 X 4 X 5 p(x 1,X 2,X 3,X 4,X 5 )= p(x 5 X 3 )p(x 4 X 2,X 3 ) p(x 3 )p(x 2 X 1 )p(x 1 ) 56
Bayesian Network Definition: X 1 X2 X 3 n i=1 P(X 1 X n ) = P(X i parents(x i )) X 4 X 5 A Bayesian Network is a directed graphical model It consists of a graph G and the conditional probabilities P These two parts full specify the distribution: Qualitative Specification: G Quantitative Specification: P 57