Data Mining & Machine Learning

Similar documents
CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 16

CS60010: Deep Learning

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Deep Feedforward Networks

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Deep Learning II: Momentum & Adaptive Step Size

SGD and Deep Learning

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Deep Learning & Artificial Intelligence WS 2018/2019

Understanding How ConvNets See

CSC321 Lecture 9: Generalization

Lecture 35: Optimization and Neural Nets

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

From perceptrons to word embeddings. Simon Šuster University of Groningen

ECS171: Machine Learning

CSC321 Lecture 9: Generalization

Convolutional Neural Networks

Deep Feedforward Networks

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Lecture 5: Logistic Regression. Neural Networks

Bayesian Networks (Part I)

Machine Learning Basics III

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Introduction to Convolutional Neural Networks (CNNs)

Lecture 2: Logistic Regression and Neural Networks

Compressing deep neural networks

Introduction to Neural Networks

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Deep Learning Recurrent Networks 2/28/2018

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Stochastic Gradient Descent. CS 584: Big Data Analytics

Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky.

CS 343: Artificial Intelligence

Neural Networks and Deep Learning

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Credit Assignment: Beyond Backpropagation

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Lecture 3 Feedforward Networks and Backpropagation

Neural Networks: Backpropagation

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

Introduction to Neural Networks

CS260: Machine Learning Algorithms

OPTIMIZATION METHODS IN DEEP LEARNING

Based on the original slides of Hung-yi Lee

CSC 578 Neural Networks and Deep Learning

Optimization and (Under/over)fitting

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Course 395: Machine Learning - Lectures

Lecture 2: Learning with neural networks

An overview of deep learning methods for genomics

Deep Learning & Neural Networks Lecture 4

RegML 2018 Class 8 Deep learning

Neural networks and support vector machines

Computational statistics

Intro to Neural Networks and Deep Learning

Deep Learning. Basics and Intuition. Constantin Gonzalez Principal Solutions Architect, Amazon Web Services

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Lecture 11 Recurrent Neural Networks I

Lecture 3 Feedforward Networks and Backpropagation

Statistical NLP for the Web

Artificial Neural Networks. MGS Lecture 2

KAI ZHANG, SHIYU LIU, AND QING WANG

How to do backpropagation in a brain

Neural networks. Chapter 19, Sections 1 5 1

Introduction to Machine Learning (67577)

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

6.036: Midterm, Spring Solutions

Learning Neural Networks

EPL442: Computational

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Least Mean Squares Regression. Machine Learning Fall 2018

ECE521 Lectures 9 Fully Connected Neural Networks

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Least Mean Squares Regression

Logistic Regression Trained with Different Loss Functions. Discussion

Lecture 17: Neural Networks and Deep Learning

Theory of Deep Learning IIb: Optimization Properties of SGD

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

A summary of Deep Learning without Poor Local Minima

Neural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]

Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville

Deep Learning Lab Course 2017 (Deep Learning Practical)

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Deep Learning Lecture 2

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Deep Learning (CNNs)

Lecture 11 Recurrent Neural Networks I

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Machine Learning and Data Mining. Linear regression. Kalev Kask

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley

Machine Learning

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

Transcription:

Data Mining & Machine Learning CS57300 Purdue University March 1, 2018 1

Recap of Last Class (Model Search) Forward and Backward passes 2

Feedforward Neural Networks Neural Networks: Architectures 2-layer Neural Net, or 1-hidden-layer Neural Net Fully-connected layers 3-layer Neural Net, or 2-hidden-layer Neural Net Ack: Fei-Fei Li & Andrej Karpathy & Justin Johnson Layers do not need to be fully connected Size of layer (number of units) is another hyper-parameter* * hyper-parameter is a parameter that is fixed, not learned 3

How Feedforward Networks Work https://www.youtube.com/watch?v=aircaruvnkk 4

Feedforward Neural Network Example (is person rich?) Input Hidden layer neurons Output layer neurons is tanned x 1 (i) w (1) 2 w (1) 3 X employs a captain x 2 (i) h 0 =1 X 1 w (1) 1 X is rich y {(0,1), (1,0)} (one-hot encoding of is rich ={False,True}) pays high property taxes x 3 (i) employs a cook Vector of input attributes of example i x 4 (i) Vector of activations + (constant 1) of hidden layer 1 of example i Vector of activations + (constant 1) of output layer of example i 2017 Bruno Ribeiro 5

General Prediction Procedure (Forward Pass) predictions of every training example i : {ĥ (K) (i)} N Variables: prediction ŷ (i) =ĥ (K) (i) Layer K {ĥ (K 1) (i)} N Layer K-1 {ĥ (K 2) (i)} N Layer 2 {ĥ (1) (i)} N Layer 1 6

2017 Bruno Ribeiro Forward + Backward Updates (following the training data) W K = W K + 1 N W K 1 = W K 1 + 1 N W 2 = W 2 + 1 N W 1 = W 1 + 1 N loss function computes prediction error of every training example i : prediction ŷ (i) =ĥ (K) (i) 1) (i) @W K 1 @h(k 1) @h (K @h (K) (i) @W K @h (2) (i) @W 2 @h (1) (i) @W 1 @h (K) @h (2) @h (1) {L(ĥ (K) (i))} N Layer K {ĥ (K 1) (i)} N Layer K-1 {ĥ (K 2) (i)} N Layer 2 {ĥ (1) (i)} N Layer 1 ( ( ( ) N @L(ĥ(K) ) @h (K) (i) ( All training examples! ) N ( = @ĥ(k) (h (K 1) (i)) @h(k 1) @h (K 1) @h (K) ) N ( @h (K 1) (h (K 2) (i)) = @h(k 2) @h (K 2) @h(k 1) ) N ( ) N @L(ĥ(K) ) @h (2) (h (1) ) = @L(ĥ(K) ) @h (1) @h (1) @h (2) ) N ) N

Stochastic Gradient Descent Rather than using all training examples in the gradient descent, we will use just a subset of the data at each time At every gradient descent step we will just use a subset of the examples At every gradient update we randomly choose another set of n training examples In practice, we do sampling without replacement; If training data is exhausted, restart sampling The new training data is known as a mini-batch The of training via gradient descent with mini-batches is called mini-batch stochastic gradient ascent (or mini-batch stochastic gradient descent if we are trying to minimize the score) 2017 Bruno Ribeiro 8

Rationalizing Mini-Batch Sizes Gradient computation: gradient is averaged over all training examples W K = W K + 1 N W K 1 = W K 1 + 1 N @h (K) (i) @W K 1) (i) @W K 1 @h(k 1) @h (K @h (K) Larger mini-batches produce more accurate gradients Would larger batches be better for training? Optimization benefits from more data because we can compute better gradients? W 2 = W 2 + 1 N @h (2) (i) @W 2 @h (2) Smaller learning rates ε provide more accurate gradient descent approximation, better models? W 1 = W 1 + 1 N @h (1) (i) @W 1 @h (1) 9

Model Search for Deep Neural Network Q: Is it better to search for the best model (highest likelihood score) using all the training data? A: Depends (Zhang et al. 2017) Deep neural network scores are nonconvex, many local maxima Pros of using all training data: Searching using all the training data, we will surely find a model that better fits the training data Cons: Using all training examples often works terribly in practice Model found by gradient descent performs poorly even on the training data itself (due to local minima) Small mini-batches are often better than larger batches but not too small and depends on the learning rate (infinitesimal gradient ε steps) (Zhang et al. 2017) Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O., 2016. Understanding deep learning requires rethinking generalization. ICLR 2017 2017 Bruno Ribeiro 10

Weird Learning Characteristics of Deep Neural Networks Google LeNet In this example: Increasing mini-batch sizes reduces model accuracy on the test data (model generalizes less) Test Accuracy But increasing learning rate improves things (i.e., making a worse approximation of gradient ascent, improves things?!?) We do not yet know why many different hypotheses (as of 2018) 2017 Bruno Ribeiro