Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I

Similar documents
CS 343: Artificial Intelligence

Deep Learning II: Momentum & Adaptive Step Size

CS 188: Artificial Intelligence Spring Announcements

Lecture 35: Optimization and Neural Nets

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

CS 188: Artificial Intelligence Spring Announcements

CS260: Machine Learning Algorithms

Optimization for Training I. First-Order Methods Training algorithm

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

Understanding How ConvNets See

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

CS 188: Artificial Intelligence

Deep Learning & Artificial Intelligence WS 2018/2019

Day 3 Lecture 3. Optimizing deep networks

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

CS 188: Artificial Intelligence. Outline

Introduction to Convolutional Neural Networks (CNNs)

Data Mining & Machine Learning

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

ECS171: Machine Learning

Bayesian Networks (Part I)

Logistic Regression. COMP 527 Danushka Bollegala

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Neural Networks: Backpropagation

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Neural networks and support vector machines

Lecture 6 Optimization for Deep Neural Networks

Perceptron (Theory) + Linear Regression

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Deep Learning Lecture 2

Nonlinear Optimization Methods for Machine Learning

Deep Learning (CNNs)

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Optimization for neural networks

CSC321 Lecture 8: Optimization

Sections 18.6 and 18.7 Analysis of Artificial Neural Networks

SGD and Deep Learning

ECS171: Machine Learning

Least Mean Squares Regression

Introduction to Neural Networks

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

OPTIMIZATION METHODS IN DEEP LEARNING

Comparison of Modern Stochastic Optimization Algorithms

Sections 18.6 and 18.7 Artificial Neural Networks

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Neural Networks: Optimization & Regularization

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CS188 Outline. We re done with Part I: Search and Planning! Part II: Probabilistic Reasoning. Part III: Machine Learning

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Normalization Techniques in Training of Deep Neural Networks

Lecture 17: Neural Networks and Deep Learning

Notes on AdaGrad. Joseph Perla 2014

STA141C: Big Data & High Performance Statistical Computing

EVERYTHING YOU NEED TO KNOW TO BUILD YOUR FIRST CONVOLUTIONAL NEURAL NETWORK (CNN)

Deep Learning Lab Course 2017 (Deep Learning Practical)

Classification with Perceptrons. Reading:

Sections 18.6 and 18.7 Artificial Neural Networks

CS60010: Deep Learning

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Artificial Neural Networks

AI Programming CS F-20 Neural Networks

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

CS 343: Artificial Intelligence

Machine Learning Basics Lecture 3: Perceptron. Princeton University COS 495 Instructor: Yingyu Liang

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Stochastic Gradient Descent

J. Sadeghi E. Patelli M. de Angelis

CS188 Outline. CS 188: Artificial Intelligence. Today. Inference in Ghostbusters. Probability. We re done with Part I: Search and Planning!

CS 188: Artificial Intelligence Spring Announcements

Loss Functions and Optimization. Lecture 3-1

Selected Topics in Optimization. Some slides borrowed from

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

In the Name of God. Lecture 11: Single Layer Perceptrons

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

CS 343: Artificial Intelligence

Lecture 5: Logistic Regression. Neural Networks

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Approximate Q-Learning. Dan Weld / University of Washington

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Binary Classification / Perceptron

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Computational Photography

Neural Networks DWML, /25

Multilayer Perceptrons and Backpropagation

Advanced Training Techniques. Prajit Ramachandran

Artificial Neuron (Perceptron)

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

CSC242: Intro to AI. Lecture 21

Deep Feedforward Networks

Intro to Neural Networks and Deep Learning

Unraveling the mysteries of stochastic gradient descent on deep neural networks

Grundlagen der Künstlichen Intelligenz

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Transcription:

Non-Linearity CS 188: Artificial Intelligence Deep Learning I Instructors: Pieter Abbeel & Anca Dragan --- University of California, Berkeley [These slides were created by Dan Klein, Pieter Abbeel, Anca Dragan for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http:// ai.berkeley.edu.] Non-Linear Separators Data that is linearly separable works out great for linear decision rules: 0 x But what are we going to do if the dataset is just too hard? Non-Linear Separators General idea: the original feature space can always be mapped to some higherdimensional feature space where the training set is separable: Φ: x φ(x) 0 x How about mapping data to a higher-dimensional space: x 2 0 x This and next slide adapted from Ray Mooney, UT

Feature Selection Computer Vision To choose between two feature sets: For feature set 1: train perceptron on training data -> Classifier 1 For feature set 2: train perceptron on training data -> Classifier 2 Evaluate performance of Classifier 1 and Classifier 2 on hold-out data Select the one performing best on the hold-out data Object Detection Manual Feature Design

Features and Generalization Features and Generalization Image HoG [Dalal and Triggs, 2005] Manual Feature Design! Deep Learning Perceptron Manual feature design requires: Domain-specific expertise Domain-specific effort What if we could learn the features, too? -> Deep Learning w 1 w 2 w 3

Two-Layer Perceptron Network N-Layer Perceptron Network w 11 w 21 w 31 w 1 w 12 w 22 w 32 w 2 w 13 w 3 w 23 w 33 Performance Speech Recognition AlexNet graph credit Matt Zeiler, Clarifai graph credit Matt Zeiler, Clarifai

N-Layer Perceptron Network Local Search Simple, general idea: Start wherever Repeat: move to the best neighboring state If no neighbors better than current, quit Neighbors = small perturbations of w Properties Plateaus and local optima How to escape plateaus and find a good local optimum? How to deal with very large parameter vectors? E.g., Perceptron Soft-Max Score for y=1: Score for y=-1: w 1 w 2 w 3 Probability of label: (predict) Objective: Classification Accuracy Objective: Issue: many plateaus! how to measure incremental progress? Log:

Two-Layer Neural Network N-Layer Neural Network w 11 w 21 w 31 w 1 w 12 w 22 w 32 w 2 w 13 w 3 w 23 w 33 Our Status 1-d optimization Our objective Changes smoothly with changes in w Doesn t suffer from the same plateaus as the perceptron network Challenge: how to find a good w? Could evaluate Then step in best direction and Equivalently: Or, evaluate derivative: Which tells which direction to step into

2-D Optimization Steepest Descent Idea: Start somewhere Repeat: Take a step in the steepest descent direction Source: Thomas Jungblut s Blog Figure source: Mathworks Generally, Steepest Direction Steepest Direction = direction of the gradient Optimization Procedure 1: Gradient Descent Init: For i = 1, 2, : learning rate --- tweaking parameter that needs to be chosen carefully How? Try multiple choices Crude rule of thumb: update changes about 0.1 1 %

Try Many Learning Rates Suppose loss function is steep vertically but shallow horizontally: Q: What is the trajectory along which we converge towards the minimum with SGD? 29 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6-30 25 Jan 2016 Suppose loss function is steep vertically but shallow horizontally: Suppose loss function is steep vertically but shallow horizontally: Q: What is the trajectory along which we converge towards the minimum with SGD? Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6-31 25 Jan 2016 Q: What is the trajectory along which we converge towards the minimum with Gradient Descent? very slow progress along flat direction, jitter along steep one Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6-32 25 Jan 2016

Optimization Procedure 2: Momentum Suppose loss function is steep vertically but shallow horizontally: Gradient Descent Init: For i = 1, 2, Momentum Init: For i = 1, 2, Q: What is the trajectory along which we converge towards the minimum with Momentum? - Physical interpretation as ball rolling down the loss function + friction (mu coefficient). - mu = usually ~0.5, 0.9, or 0.99 (Sometimes annealed over time, e.g. from 0.5 -> 0.99) Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 6-34 25 Jan 2016 Some More Advanced Optimization Methods* Remaining Pieces Nesterov Momentum Update AdaGrad RMSProp ADAM L-BFGS Optimizing machine learning objectives: Stochastic Descent Mini-batches Improving generalization Drop-out Activation functions Initialization Renormalization Computing the gradient Backprop Gradient checking