Bayesian Networks (Part I)

Similar documents
Hidden Markov Models

Deep Learning (CNNs)

Hidden Markov Models

Directed Graphical Models (aka. Bayesian Networks)

Bayesian Networks (Part II)

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Lecture 17: Neural Networks and Deep Learning

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Introduction to Convolutional Neural Networks (CNNs)

Lecture 7 Convolutional Neural Networks

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Lecture 35: Optimization and Neural Nets

Convolutional Neural Networks

Jakub Hajic Artificial Intelligence Seminar I

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

CS 1674: Intro to Computer Vision. Final Review. Prof. Adriana Kovashka University of Pittsburgh December 7, 2016

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

SGD and Deep Learning

Jelinek Summer School Machine Learning

Perceptron & Neural Networks

Convolutional Neural Networks. Srikumar Ramalingam

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

MLE/MAP + Naïve Bayes

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Undirected Graphical Models

Neural Architectures for Image, Language, and Speech Processing

Introduction to Deep Neural Networks

Spatial Transformer Networks

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

CS 343: Artificial Intelligence

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Chapter 16. Structured Probabilistic Models for Deep Learning

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Introduction to Machine Learning (67577)

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Understanding How ConvNets See

Data Mining & Machine Learning

From perceptrons to word embeddings. Simon Šuster University of Groningen

Neural networks COMS 4771

Deep Learning Lab Course 2017 (Deep Learning Practical)

Outline. CSCI567 Machine Learning (Spring 2019) Outline. Math formulation. Prof. Victor Adamchik. Feb. 12, 2019

Perceptron (Theory) + Linear Regression

Introduction to Deep Learning

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Neural Networks 2. 2 Receptive fields and dealing with image inputs

UNSUPERVISED LEARNING

Deep Convolutional Neural Networks for Pairwise Causality

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Sum-Product Networks: A New Deep Architecture

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Deep Learning Recurrent Networks 2/28/2018

Convolutional Neural Network Architecture

The Origin of Deep Learning. Lili Mou Jan, 2015

Statistical Machine Learning

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Sequential Supervised Learning

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

The connection of dropout and Bayesian statistics

Unsupervised Learning

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

Lecture 11: Hidden Markov Models

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Review: Bayesian learning and inference

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Intelligent Systems (AI-2)

Tasks ADAS. Self Driving. Non-machine Learning. Traditional MLP. Machine-Learning based method. Supervised CNN. Methods. Deep-Learning based

CS60010: Deep Learning

Spatial Transformation

Intelligent Systems (AI-2)

Lecture 16 Deep Neural Generative Models

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Asaf Bar Zvi Adi Hayat. Semantic Segmentation

A summary of Deep Learning without Poor Local Minima

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

arxiv: v1 [cs.cv] 11 May 2015 Abstract

Classification with Perceptrons. Reading:

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Some Applications of Machine Learning to Astronomy. Eduardo Bezerra 20/fev/2018

Machine Learning for Signal Processing Bayes Classification and Regression

Theories of Deep Learning

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

An overview of deep learning methods for genomics

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Final Examination CS 540-2: Introduction to Artificial Intelligence

Probabilistic Graphical Models & Applications

Binary Convolutional Neural Network on RRAM

Transcription:

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Bayesian Networks (Part I) Graphical Model Readings: Murphy 10 10.2.1 Bishop 8.1, 8.2.2 HTF - - Mitchell 6.11 Matt Gormley Lecture 22 April 10, 2017 1

Reminders Peer Tutoring Homework 7: Deep Learning Release: Wed, Apr. 05 Part I due Wed, Apr. 12 Start Early Part II due Mon, Apr. 17 2

CONVOLUTIONAL NEURAL NETS 3

Deep Learning Outline Background: Computer Vision Image Classification ILSVRC 2010-2016 Traditional Feature Extraction Methods Convolution as Feature Extraction Convolutional Neural Networks (CNNs) Learning Feature Abstractions Common CNN Layers: Convolutional Layer Max- Pooling Layer Fully- connected Layer (w/tensor input) Softmax Layer ReLU Layer Background: Subgradient Architecture: LeNet Architecture: AlexNet Training a CNN SGD for CNNs Backpropagation for CNNs 4

Convolutional Neural Network (CNN) Typical layers include: Convolutional layer Max- pooling layer Fully- connected (Linear) layer ReLUlayer (or some other nonlinear activation function) Softmax These can be arranged into arbitrarily deep topologies Architecture #1: LeNet- 5 5

Input Image Convolutional Layer CNN key idea: Treat convolution matrix as parameters and learn them! 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 Learned Convolution θ 11 θ 12 θ 13 θ 21 θ 22 θ 23 θ 31 θ 32 θ 33 Convolved Image.4.5.5.5.4.4.2.3.6.3.5.4.4.2.1.5.6.2.1 0.4.3.1 0 0 6

Downsampling by Averaging Downsampling by averaging used to be a common approach This is a special case of convolution where the weights are fixed to a uniform distribution The example below uses a stride of 2 Input Image 1 1 1 1 1 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 Convolution 1/4 1/4 1/4 1/4 Convolved Image 3/4 3/4 1/4 3/4 1/4 0 1/4 0 0 7

Max- Pooling Max- pooling is another (common) form of downsampling Instead of averaging, we take the max value within the same range as the equivalently- sized convolution The example below uses a stride of 2 Input Image 1 1 1 1 1 0 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 Max- pooling x i,j x i+1,j x i,j+1 x i+1,j+1 Max- Pooled Image 1 1 1 1 1 0 1 0 0 8

Multi- Class Output Output Hidden Layer Input 10

Multi- Class Output Softmax Layer: y k = (b k ) K l=1 (b l) (F) Loss J = K k=1 y k (y k) (E) Output (softmax) y k = (b k) K l=1 (b l) (D) Output (linear) b k = D j=0 kjz j k Output (C) Hidden (nonlinear) z j = (a j ), j Hidden Layer (B) Hidden (linear) a j = M i=0 jix i, j Input (A) Input Given x i, i 11

Whiteboard Training a CNN SGD for CNNs Backpropagation for CNNs 12

Whiteboard Common CNN Layers ReLU Layer Background: Subgradient Fully- connected Layer (w/tensor input) Softmax Layer Convolutional Layer Max- Pooling Layer 13

Convolutional Layer 14

Convolutional Layer 15

Max- Pooling Layer 16

Max- Pooling Layer 17

Convolutional Neural Network (CNN) Typical layers include: Convolutional layer Max- pooling layer Fully- connected (Linear) layer ReLUlayer (or some other nonlinear activation function) Softmax These can be arranged into arbitrarily deep topologies Architecture #1: LeNet- 5 18

Architecture #2: AlexNet CNN for Image Classification (Krizhevsky, Sutskever & Hinton, 2012) 15.3% error on ImageNet LSVRC- 2012 contest Input image (pixels) Five convolutional layers (w/max- pooling) Three fully connected layers 1000- way softmax 19

CNNs for Image Recognition Slide from Kaiming He 20

Mini- Batch SGD Gradient Descent: Compute true gradient exactly from all N examples Mini- Batch SGD: Approximate true gradient by the average gradient of K randomly chosen examples Stochastic Gradient Descent (SGD): Approximate true gradient by the gradient of one randomly chosen example 21

Mini- Batch SGD Three variants of first- order optimization: 22

CNN VISUALIZATIONS 23

3D Visualization of CNN http://scs.ryerson.ca/~aharley/vis/conv/

Convolution of a Color Image Color images consist of 3 floats per pixel for RGB (red, green blue) color values Convolution must also be 3- dimensional 32 32x32x3 image 5x5x3 filter activation map 28 convolve (slide) over all spatial locations 3 32 Figure from Fei- Fei Li & Andrej Karpathy & Justin Johnson (CS231N) 1 28 25

Animation of 3D Convolution http://cs231n.github.io/convolutional- networks/ Figure from Fei- Fei Li & Andrej Karpathy & Justin Johnson (CS231N) 26

MNIST Digit Recognition with CNNs (in your browser) https://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html Figure from Andrej Karpathy 27

CNN Summary CNNs Are used for all aspects of computer vision, and have won numerous pattern recognition competitions Able learn interpretable features at different levels of abstraction Typically, consist of convolution layers, pooling layers, nonlinearities, and fully connected layers Other Resources: Readings on course website Andrej Karpathy, CS231n Notes http://cs231n.github.io/convolutional- networks/ 28

BAYESIAN NETWORKS 29

Bayes Nets Outline Motivation Structured Prediction Background Conditional Independence Chain Rule of Probability Directed Graphical Models Writing Joint Distributions Definition: Bayesian Network Qualitative Specification Quantitative Specification Familiar Models as Bayes Nets Conditional Independence in Bayes Nets Three case studies D- separation Markov blanket Learning Fully Observed Bayes Net (Partially Observed Bayes Net) Inference Sampling directly from the joint distribution Gibbs Sampling 31

MOTIVATION: STRUCTURED PREDICTION 32

Structured Prediction Most of the models we ve seen so far were for classification Given observations: x = (x 1, x 2,, x K ) Predict a (binary) label: y Many real- world problems require structured prediction Given observations: x = (x 1, x 2,, x K ) Predict a structure: y = (y 1, y 2,, y J ) Some classification problems benefit from latent structure 33

Structured Prediction Examples Examples of structured prediction Part- of- speech (POS) tagging Handwriting recognition Speech recognition Word alignment Congressional voting Examples of latent structure Object recognition 34

Dataset for Supervised Part- of- Speech (POS) Tagging Data: D = {x (n), y (n) } N n=1 Sample 1: Sample 2: Sample 3: Sample 4: n v p d n time n n v d n time flies like an arrow flies like an arrow n v p n n flies p n n v v with fly with their wings time you will see y (1) x (1) y (2) x (2) y (3) x (3) y (4) x (4) 35

Data: Sample 1: Dataset for Supervised Handwriting Recognition D = {x (n), y (n) } N n=1 u n e x p e c t e d total of 1,63 words, and its part of s speech label y (1) 1. First- 2. Four x (1) Hand Sample 2: v o l c a n i c y (2) x (2) Sample 2: e m b r a c e s y (3) x (3) Fig. 5. Handwriting recognition: Example words from the dataset used. Figures from (Chatzis & Demiris, 2013) 36

Dataset for Supervised Phoneme (Speech) Recognition Data: D = {x (n), y (n) } N n=1 Sample 1: h# dh ih s w uh z iy z iy y (1) x (1) Sample 2: f ao r ah s s h# y (2) x (2) Figures from (Jansen & Niyogi, 2013) 37

Application: Word Alignment / Phrase Extraction Variables (boolean): For each (Chinese phrase, English phrase) pair, are they linked? Interactions: Word fertilities Few jumps (discontinuities) Syntactic reorderings ITG contraint on alignment Phrases are disjoint (?) (Burkett & Klein, 2012) 38

Application: Congressional Voting Variables: Text of all speeches of a representative Local contexts of references between two representatives Interactions: Words used by representative and their vote Pairs of representatives and their local context (Stoyanov & Eisner, 2012) 39

Structured Prediction Examples Examples of structured prediction Part- of- speech (POS) tagging Handwriting recognition Speech recognition Word alignment Congressional voting Examples of latent structure Object recognition 40

Case Study: Object Recognition Data consists of images x and labels y. x(2) x(1) pigeon y(1) rhinoceros x(3) leopard y(3) y(2) x(4) llama y(4) 41

Case Study: Object Recognition Data consists of images x and labels y. Preprocess data into patches Posit a latent labeling z describing the object s parts (e.g. head, leg, tail, torso, grass) Define graphical model with these latent variables in mind z is not observed at train or test time leopard 42

Case Study: Object Recognition Data consists of images x and labels y. Preprocess data into patches Posit a latent labeling z describing the object s parts (e.g. head, leg, tail, torso, grass) Z7 X7 Define graphical model with these latent variables in mind z is not observed at train or test time Z2 X2 leopard Y 43

Case Study: Object Recognition Data consists of images x and labels y. Preprocess data into patches Posit a latent labeling z describing the object s parts (e.g. head, leg, tail, torso, grass) Z7 ψ4 ψ1 X7 Define graphical model with these latent variables in mind z is not observed at train or test time ψ4 ψ2 Z2 ψ4 ψ3 X2 leopard Y 44

Structured Prediction Preview of challenges to come Consider the task of finding the most probable assignment to the output Classification ŷ = p(y ) y where y {+1, 1} Structured Prediction ˆ = p( ) where Y and Y is very large 45

Machine Learning The data inspires the structures we want to predict Inference finds {best structure, marginals, partition function} for a new observation (Inference is usually called as a subroutine in learning) Domain Knowledge Combinatorial Optimization ML Mathematical Modeling Optimization Our model defines a score for each structure It also tells us what to optimize Learning tunes the parameters of the model 46

3 Alice saw Bob on a Machine Learning Alice saw Bob on a hill with a telescope Data 4 time flies like an arrow Model X 1 time flies like an arrow X2 X 3 time flies like an arrow time flies like an arrow Objective X 4 X 5 Inference time flies like an arrow Learning 2 (Inference is usually called as a subroutine in learning) 47

BACKGROUND 48

Whiteboard Background Chain Rule of Probability Conditional Independence 49

Background: Chain Rule of Probability For random variables A and B: P (A, B) =P (A B)P (B) For random variables X 1,X 2,X 3,X 4 : P (X 1,X 2,X 3,X 4 )=P(X 1 X 2,X 3,X 4 ) P (X 2 X 3,X 4 ) P (X 3 X 4 ) P (X 4 ) 50

Background: Conditional Independence Random variables A and B are conditionally independent given C if: P (A, B C) =P (A C)P (B C) (1) or equivalently: P (A B,C) =P (A C) (2) We write this as: A B C (3) Later we will also write: I<A, {C}, B> 51

Bayesian Networks DIRECTED GRAPHICAL MODELS 52

Example: Tornado Alarms 1. Imagine that you work at the 911 call center in Dallas 2. You receive six calls informing you that the Emergency Weather Sirens are going off 3. What do you conclude? Figure from https://www.nytimes.com/2017/04/08/us/dallas- emergency- sirens- hacking.html 53

Example: Tornado Alarms 1. Imagine that you work at the 911 call center in Dallas 2. You receive six calls informing you that the Emergency Weather Sirens are going off 3. What do you conclude? Figure from https://www.nytimes.com/2017/04/08/us/dallas- emergency- sirens- hacking.html 54

Whiteboard Directed Graphical Models (Bayes Nets) Example: Tornado Alarms Writing Joint Distributions Idea #1: Giant Table Idea #2: Rewrite using chain rule Idea #3: Assume full independence Idea #4: Drop variables from RHS of conditionals Definition: Bayesian Network Observed Variables in Graphical Models 55

Bayesian Network X2 X 1 X 3 X 4 X 5 p(x 1,X 2,X 3,X 4,X 5 )= p(x 5 X 3 )p(x 4 X 2,X 3 ) p(x 3 )p(x 2 X 1 )p(x 1 ) 56

Bayesian Network Definition: X 1 X2 X 3 n i=1 P(X 1 X n ) = P(X i parents(x i )) X 4 X 5 A Bayesian Network is a directed graphical model It consists of a graph G and the conditional probabilities P These two parts full specify the distribution: Qualitative Specification: G Quantitative Specification: P 57