Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Similar documents
Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Introduction to Convolutional Neural Networks (CNNs)

Neural Networks and Deep Learning

Logistic Regression. COMP 527 Danushka Bollegala

CSC321 Lecture 9: Generalization

Deep Learning Lab Course 2017 (Deep Learning Practical)

Machine Learning Basics

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

CSCI567 Machine Learning (Fall 2018)

Introduction to Neural Networks

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

CSC 578 Neural Networks and Deep Learning

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

CSC321 Lecture 5: Multilayer Perceptrons

Deep Feedforward Networks

Feed-forward Network Functions

Neural Networks: Backpropagation

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CSC321 Lecture 9: Generalization

Introduction to Deep Neural Networks

1 What a Neural Network Computes

Nonlinear Classification

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

ECE521 Lectures 9 Fully Connected Neural Networks

From perceptrons to word embeddings. Simon Šuster University of Groningen

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Lecture 17: Neural Networks and Deep Learning

Introduction to Neural Networks

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

CSC321 Lecture 4: Learning a Classifier

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning Lecture 12

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

Lecture 9: Generalization

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Reading Group on Deep Learning Session 1

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

text classification 3: neural networks

ECE521 Lecture 7/8. Logistic Regression

Neural networks and support vector machines

Machine Learning Lecture 10

Computational statistics

Lecture 5 Neural models for NLP

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks: Optimization & Regularization

Introduction to Machine Learning (67577)

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Deep Feedforward Networks

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

CSC321 Lecture 4: Learning a Classifier

Gradient Boosting (Continued)

Bits of Machine Learning Part 1: Supervised Learning

Neural Network Training

Warm up: risk prediction with logistic regression

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Jakub Hajic Artificial Intelligence Seminar I

K-Means and Gaussian Mixture Models

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Deep Learning (CNNs)

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

CSC321 Lecture 8: Optimization

CS 1674: Intro to Computer Vision. Final Review. Prof. Adriana Kovashka University of Pittsburgh December 7, 2016

Lecture 35: Optimization and Neural Nets

Lecture 2: Learning with neural networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I

STA 414/2104: Lecture 8

Support Vector Machines

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Summary and discussion of: Dropout Training as Adaptive Regularization

Bayesian Networks (Part I)

Deep Learning II: Momentum & Adaptive Step Size

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Loss Functions and Optimization. Lecture 3-1

CSC321 Lecture 7: Optimization

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

An overview of deep learning methods for genomics

Loss Functions and Optimization. Lecture 3-1

Convolutional Neural Networks

INTRODUCTION TO DATA SCIENCE

Advanced Training Techniques. Prajit Ramachandran

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Linear Regression (continued)

Logistic Regression & Neural Networks

Transcription:

Neural Networks David Rosenberg New York University July 26, 2017 David Rosenberg (New York University) DS-GA 1003 July 26, 2017 1 / 35

Neural Networks Overview Objectives What are neural networks? How do they fit into our toolbox? When should we consider using them? David Rosenberg (New York University) DS-GA 1003 July 26, 2017 2 / 35

Neural Networks Overview Linear Prediction Functions Linear prediction functions: SVM, ridge regression, Lasso Generate the feature vector φ(x) by hand. Learn weight vector w from data. So score = w T φ(x) From Percy Liang s "Lecture 3" slides from Stanford s CS221, Autumn 2014. avid Rosenberg (New York University) DS-GA 1003 July 26, 2017 3 / 35

Neural Networks Overview Basic Neural Network (Multilayer Perceptron) Add an extra layer with a nonlinear transformation: We ve introduced hidden nodes h 1 and h 2. h i = σ ( v T i φ(x) ), where σ is a nonlinear activation function. (We ll come back to this.) From Percy Liang s "Lecture 3" slides from Stanford s CS221, Autumn 2014. avid Rosenberg (New York University) DS-GA 1003 July 26, 2017 4 / 35

Neural Networks Overview Basic Neural Network Score is just score = w 1 h 1 + w 2 h 2 = w 1 σ(v T 1 φ(x)) + w 2 σ ( v T 2 φ(x) ) This is the basic recipe. We can add more hidden nodes. We can add more hidden layers. From Percy Liang s "Lecture 3" slides from Stanford s CS221, Autumn 2014. David Rosenberg (New York University) DS-GA 1003 July 26, 2017 5 / 35

Neural Networks Overview Activation Functions The nonlinearity of the activation function is a key ingredient. The logistic sigmoid function is one of the more commonly used: σ(x) = 1 1 + e x. David Rosenberg (New York University) DS-GA 1003 July 26, 2017 6 / 35

Neural Networks Overview Activation Functions More recently, the rectified linear function has been very popular: σ(x) = max(0,x). RELU is much faster to calculate, and to calculate its derivatives. David Rosenberg (New York University) DS-GA 1003 July 26, 2017 7 / 35

Neural Networks Overview Approximation Ability: f (x) = x 2 3 hidden units; logistic activation functions Blue dots are training points; Dashed lines are hidden unit outputs; Final output in Red. From Bishop s Pattern Recognition and Machine Learning, Fig 5.3 avid Rosenberg (New York University) DS-GA 1003 July 26, 2017 8 / 35

Neural Networks Overview Approximation Ability: f (x) = sin(x) 3 hidden units; logistic activation function Blue dots are training points; Dashed lines are hidden unit outputs; Final output in Red. From Bishop s Pattern Recognition and Machine Learning, Fig 5.3 avid Rosenberg (New York University) DS-GA 1003 July 26, 2017 9 / 35

Neural Networks Overview Approximation Ability: f (x) = x 3 hidden units; logistic activation functions Blue dots are training points; Dashed lines are hidden unit outputs; Final output in Red. From Bishop s Pattern Recognition and Machine Learning, Fig 5.3 avid Rosenberg (New York University) DS-GA 1003 July 26, 2017 10 / 35

Neural Networks Overview Approximation Ability: f (x) = 1(x > 0) 3 hidden units; logistic activation function Blue dots are training points; Dashed lines are hidden unit outputs; Final output in Red. From Bishop s Pattern Recognition and Machine Learning, Fig 5.3 avid Rosenberg (New York University) DS-GA 1003 July 26, 2017 11 / 35

Neural Networks Overview Neural Network: Hidden Nodes as Learned Features Can interpret h 1 and h 2 as nonlinear features learned from data. From Percy Liang s "Lecture 3" slides from Stanford s CS221, Autumn 2014. avid Rosenberg (New York University) DS-GA 1003 July 26, 2017 12 / 35

Neural Networks Overview Facial Recognition: Learned Features From Andrew Ng s CS229 Deep Learning slides avid (http://cs229.stanford.edu/materials/cs229-deeplearning.pdf) Rosenberg (New York University) DS-GA 1003 July 26, 2017 13 / 35

Neural Networks Overview Neural Network: The Hypothesis Space What hyperparameters describe a neural network? Number of layers Number of nodes in each hidden layer Activation function (many to choose from) Example neural network hypothesis space: F = { f : R d R f is a NN with 2 hidden layers, 500 nodes in each } Functions in F parameterized by the weights between nodes. avid Rosenberg (New York University) DS-GA 1003 July 26, 2017 14 / 35

Neural Networks Overview Neural Network: Loss Functions and Learning Neural networks give a new hypothesis space. But we can use all the same loss functions we ve used before. Optimization method of choice: mini-batch SGD. In practice, lots of little tweaks; see e.g. AdaGrad and Adam avid Rosenberg (New York University) DS-GA 1003 July 26, 2017 15 / 35

Neural Networks Overview Neural Network: Objective Function In our simple network, the output score is given by f (x) = w 1 σ(v T 1 φ(x)) + w 2 σ ( v T 2 φ(x) ) Objective with square loss is then J(w,v) = n (y i f w,v (x i )) 2 i=1 Note: J(w,v) is not convex. makes optimization much more difficult accounts for many of the tricks of the trade avid Rosenberg (New York University) DS-GA 1003 July 26, 2017 16 / 35

Neural Networks Overview Learning with Back-Propagation Back-propagation is an algorithm for computing the gradient With lots of chain rule, you could also work out the gradient by hand. Back-propagation is a clean way to organize the computation of the gradient an efficient way to compute the gradient Nice introduction to this perspective: Stanford CS221 Lecture 3 (2016), Slides 75-94 avid Rosenberg (New York University) DS-GA 1003 July 26, 2017 17 / 35

Neural Network Regularization Neural Network Regularization Neural networks are very expressive. Correspond to big hypothesis spaces. Many approaches are used for regularization. David Rosenberg (New York University) DS-GA 1003 July 26, 2017 18 / 35

Neural Network Regularization Tikhonov Regularization? Sure. Can add an l 2 and/or l 1 regularization terms to our objective: n J(w,v) = (y i f w,v (x i )) 2 + λ 1 w 2 + λ 2 v 2 i=1 In neural network literature, this is often called weight decay. avid Rosenberg (New York University) DS-GA 1003 July 26, 2017 19 / 35

Neural Network Regularization Regularization by Early Stopping As we train, check performance on validation set every once in a while. Don t stop immediately after validation error goes back up. The patience parameter: the number training rounds to continue after finding a minimum of validation error. Start with patience = 10000. Whenever we find a minimum at step T, Set patience patience + ct, for some constant c. Then run at least patience extra steps before stopping. See http://arxiv.org/pdf/1206.5533v2.pdf for details. avid Rosenberg (New York University) DS-GA 1003 July 26, 2017 20 / 35

Neural Network Regularization Max-Norm Regularization Max-norm regularization: Enforce max norm of incoming weight vector at every hidden node to be bounded: w 2 c. Project any w that s too large onto ball of radius c. It s like l 2 -complexity control, but locally at each node. Why? There are heuristic justifications, but proof is in the performance. We ll see below. See http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf for details. avid Rosenberg (New York University) DS-GA 1003 July 26, 2017 21 / 35

Neural Network Regularization Dropout for Regularization A recent trick for improving generalization performance is dropout. A fixed probability p is chosen. Before every stochastic gradient step, each node is selected for dropout with probability p a dropout node is removed, along with its links after the stochastic gradient step, all nodes are restored. Figure from http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf. avid Rosenberg (New York University) DS-GA 1003 July 26, 2017 22 / 35

Neural Network Regularization Dropout for Regularization At prediction time all nodes are present outgoing weights are multiplied by p. Dropout probability set using a validation set, or just set at 0.5. Closer to 0.8 usually works better for input units. Figure from http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf. avid Rosenberg (New York University) DS-GA 1003 July 26, 2017 23 / 35

Neural Network Regularization Dropout: Why might this help? Since any node may randomly disappear, forced to spread the knowledge across the nodes. Each hidden only gets a randomly chosen sample of its inputs, so won t become too reliant on any single input. More robust. avid Rosenberg (New York University) DS-GA 1003 July 26, 2017 24 / 35

Neural Network Regularization Dropout: Does it help? Figure from http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf. avid Rosenberg (New York University) DS-GA 1003 July 26, 2017 25 / 35

Neural Network Regularization How big a network? How many hidden units? With proper regularization, too many doesn t hurt. Except in computation time. David Rosenberg (New York University) DS-GA 1003 July 26, 2017 26 / 35

Multiple Output Networks Multiple Output Neural Networks Very easy to add extra outputs to neural network structure. From Andrew Ng s CS229 Deep Learning slides (http://cs229.stanford.edu/materials/cs229-deeplearning.pdf) avid Rosenberg (New York University) DS-GA 1003 July 26, 2017 27 / 35

Multiple Output Networks Multitask Learning Suppose X = {Natural Images}. We have two tasks: Does the image have a cat? Does the image have a dog? Can have one output for each task. Seems plausible that basic pixel features would be shared by tasks. Learn them on the same neural network benefit both tasks. avid Rosenberg (New York University) DS-GA 1003 July 26, 2017 28 / 35

Multiple Output Networks Single Task with Extra Tasks Only one task we re interested in. Gather data from related tasks. Train them along with the task you re interested in. No related tasks? Another trick: Choose any input feature. Change it s value to zero. Make the prediction problem to predict the value of that feature. Can help make model more robust (not depending too heavily on any single input). avid Rosenberg (New York University) DS-GA 1003 July 26, 2017 29 / 35

Multiple Output Networks Multiclass Classification Could make each class a separate task / output. Suppose we have K classes. Use a one-hot encoding of each y i {1,...,K}: y i = (y i1,...,y ik ) with y ik = 1(y i = k). K output scores: f 1 (x),...,f K (x). Each f k is trained to predict 1 if class is k, 0 otherwise. Predict with f (x) = arg max k [f k (x)]. Old days: train each output separately, e.g. with square loss. avid Rosenberg (New York University) DS-GA 1003 July 26, 2017 30 / 35

Multiple Output Networks Multiclass Classification: Cross-Entropy Loss Network can do better if it knows that classes are mutually exclusive. Need to introduce a joint loss across the outputs. Joint loss function (cross-entropy/deviance): where y ik = 1(y i = k). l(w,v) = n i=1 i=1 K y ik logf k (x i ), This is the negative log-likelihood we get for softmax predictions in multinomial logistic regression. avid Rosenberg (New York University) DS-GA 1003 July 26, 2017 31 / 35

Neural Networks for Features OverFeat: Features OverFeat is a neural network for image classification Trained on the huge ImageNet dataset Lots of computing resources used for training the network. All those hidden layers of the network are very valuable features. Paper: CNN Features off-the-shelf: an Astounding Baseline for Recognition Showed that using features from OverFeat makes it easy to achieve state-of-the-art performance on new vision tasks. OverFeat code is at https://github.com/sermanet/overfeat avid Rosenberg (New York University) DS-GA 1003 July 26, 2017 32 / 35

Neural Networks: When and why? Neural Networks Benefit from Big Data From Andrew Ng s CS229 Deep Learning slides (http://cs229.stanford.edu/materials/cs229-deeplearning.pdf) avid Rosenberg (New York University) DS-GA 1003 July 26, 2017 33 / 35

Neural Networks: When and why? Big Data Requires Big Resources Best results always involve GPU processing. Typically on huge networks. From Andrew Ng s CS229 Deep Learning slides (http://cs229.stanford.edu/materials/cs229-deeplearning.pdf) David Rosenberg (New York University) DS-GA 1003 July 26, 2017 34 / 35

Neural Networks: When and why? Neural Networks: When to Use? Computer vision problems All state of the art methods use neural networks Speech recognition All state of the art methods use neural networks Natural Language problems Maybe. State-of-the-art, but not as large a margin. Check out word2vec https://code.google.com/p/word2vec/. Represents words using real-valued vectors. Potentially much better than bag of words. avid Rosenberg (New York University) DS-GA 1003 July 26, 2017 35 / 35