Machine Learning (CSE 446): Neural Networks

Similar documents
Machine Learning (CSE 446): Backpropagation

CSC321 Lecture 5: Multilayer Perceptrons

Neural networks. Chapter 20. Chapter 20 1

CSC 411 Lecture 10: Neural Networks

Neural networks. Chapter 19, Sections 1 5 1

Nonlinear Classification

CSC242: Intro to AI. Lecture 21

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Neural Networks: Backpropagation

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Neural networks. Chapter 20, Section 5 1

Neural Networks and Deep Learning

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Artifical Neural Networks

Multilayer Perceptron

Numerical Learning Algorithms

Artificial Neuron (Perceptron)

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

17 Neural Networks NEURAL NETWORKS. x XOR 1. x Jonathan Richard Shewchuk

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Stochastic Gradient Descent

Neural Networks: Introduction

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

CSC321 Lecture 4: Learning a Classifier

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

CPSC 340: Machine Learning and Data Mining

Advanced statistical methods for data analysis Lecture 2

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Intro to Neural Networks and Deep Learning

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

10 NEURAL NETWORKS Bio-inspired Multi-Layer Networks. Learning Objectives:

Feedforward Neural Nets and Backpropagation

Supervised Learning. George Konidaris

INTRODUCTION TO DATA SCIENCE

Week 5: Logistic Regression & Neural Networks

CSC321 Lecture 4: Learning a Classifier

CS489/698: Intro to ML

Support Vector Machines

Logistic Regression & Neural Networks

Logistic Regression. COMP 527 Danushka Bollegala

Lecture 17: Neural Networks and Deep Learning

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Artificial neural networks

Artificial Neural Network

Optimization and Gradient Descent

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

CSC321 Lecture 5 Learning in a Single Neuron

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

SGD and Deep Learning

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Machine Learning and Data Mining. Linear classification. Kalev Kask

4. Multilayer Perceptrons

Neural Networks. Xiaojin Zhu Computer Sciences Department University of Wisconsin, Madison. slide 1

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Artificial Neural Networks 2

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Deep Feedforward Networks

Introduction Biologically Motivated Crude Model Backpropagation

Artificial Neural Networks

Linear discriminant functions

Lecture 5: Logistic Regression. Neural Networks

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

A Course in Machine Learning

1 What a Neural Network Computes

Overfitting, Bias / Variance Analysis

Feedforward Neural Networks

Learning from Data: Multi-layer Perceptrons

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Machine Learning (CSE 446): Probabilistic Machine Learning

AI Programming CS F-20 Neural Networks

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Lecture 3 Feedforward Networks and Backpropagation

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Machine Learning for NLP

CSC321 Lecture 2: Linear Regression

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Neural Networks in Structured Prediction. November 17, 2015

Lecture 6. Notes on Linear Algebra. Perceptron

Neural Networks (Part 1) Goals for the lecture

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Linear Classifiers: Expressiveness

Lecture 2: Logistic Regression and Neural Networks

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

ECS171: Machine Learning

FINAL: CS 6375 (Machine Learning) Fall 2014

Revision: Neural Network

18.6 Regression and Classification with Linear Models

Lecture 4: Training a Classifier

Machine Learning Lecture 5

Feedforward Neural Networks

Logistic Regression. Machine Learning Fall 2018

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Backpropagation and Neural Networks part 1. Lecture 4-1

Neural networks COMS 4771

Learning from Examples

Machine Learning Basics III

Introduction to Logistic Regression and Support Vector Machine

Transcription:

Machine Learning (CSE 446): Neural Networks Noah Smith c 2017 University of Washington nasmith@cs.washington.edu November 6, 2017 1 / 22

Admin No Wednesday office hours for Noah; no lecture Friday. 2 / 22

Classifiers We ve Covered So Far decision boundary? difficult part of learning? decision trees piecewise-axis-aligned greedy split decisions K-nearest neighbors possibly very complex indexing training data perceptron linear iterative optimization method required logistic regression linear iterative optimization method required naïve Bayes linear (see A4) none 3 / 22

Classifiers We ve Covered So Far decision boundary? difficult part of learning? decision trees piecewise-axis-aligned greedy split decisions K-nearest neighbors possibly very complex indexing training data perceptron linear iterative optimization method required logistic regression linear iterative optimization method required naïve Bayes linear (see A4) none The next methods we ll cover permit nonlinear decision boundaries. 4 / 22

Inspiration from Neurons Image from Wikimedia Commons. Input signals come in through dendrites, output signal passes out through the axon. 5 / 22

Neuron-Inspired Classifiers input weight parameters x[1] w[1] x[2] w[2] activation output x[3] w[3]! fire, or not? ŷ x[d] w[d] bias parameter b 6 / 22

Neuron-Inspired Classifiers input x[1] w[1] f x[2] w[2] x[3] w[3] output! ŷ x[d] w[d] b 7 / 22

Neuron-Inspired Classifiers correct output y n L n loss input weights activation x n w! ŷ classifier output, f b 8 / 22

Neuron-Inspired Classifiers -1.0 0.0 1.0-10 -5 0 5 10 Hyperbolic tangent function, tanh(z) = ez e z e z + e z. Generalization: apply elementwise to a vector, so that tanh : R k ( 1, 1) k. 9 / 22

Neuron-Inspired Classifiers correct output y n L n loss input weights hidden units x n w 1 tanh b 1 v 1 activation! ŷ w 2 tanh classifier output, f b 2 v 2 10 / 22

Neuron-Inspired Classifiers correct output y n L n loss hidden units input weights x n W tanh activation! ŷ b v classifier output, f 11 / 22

Two-Layer Neural Network ( H ) f(x) = sign v h tanh (w h x + b h ) h=1 = sign (v tanh (Wx + b)) 12 / 22

Two-Layer Neural Network ( H ) f(x) = sign v h tanh (w h x + b h ) h=1 = sign (v tanh (Wx + b)) Two-layer networks allow decision boundaries that are nonlinear. 13 / 22

Two-Layer Neural Network ( H ) f(x) = sign v h tanh (w h x + b h ) h=1 = sign (v tanh (Wx + b)) Two-layer networks allow decision boundaries that are nonlinear. It s fairly easy to show that XOR can be simulated (recall conjunction features from the practical issues lecture on 10/18). 14 / 22

Two-Layer Neural Network ( H ) f(x) = sign v h tanh (w h x + b h ) h=1 = sign (v tanh (Wx + b)) Two-layer networks allow decision boundaries that are nonlinear. It s fairly easy to show that XOR can be simulated (recall conjunction features from the practical issues lecture on 10/18). Theoretical result: any continuous function on a bounded region in R d can be approximated arbitrarily well, with a finite number of hidden units. 15 / 22

Two-Layer Neural Network ( H ) f(x) = sign v h tanh (w h x + b h ) h=1 = sign (v tanh (Wx + b)) Two-layer networks allow decision boundaries that are nonlinear. It s fairly easy to show that XOR can be simulated (recall conjunction features from the practical issues lecture on 10/18). Theoretical result: any continuous function on a bounded region in R d can be approximated arbitrarily well, with a finite number of hidden units. The number of hidden units affects how complicated your decision boundary can be and how easily you will overfit. 16 / 22

Learning with a Two-Layer Network Parameters: W R H d, b R H, and v R H 17 / 22

Learning with a Two-Layer Network Parameters: W R H d, b R H, and v R H If we choose a differentiable loss, then the the whole function will be differentiable with respect to all parameters. 18 / 22

Learning with a Two-Layer Network Parameters: W R H d, b R H, and v R H If we choose a differentiable loss, then the the whole function will be differentiable with respect to all parameters. Because of the squashing function, which is not convex, the overall learning problem is not convex. 19 / 22

Learning with a Two-Layer Network Parameters: W R H d, b R H, and v R H If we choose a differentiable loss, then the the whole function will be differentiable with respect to all parameters. Because of the squashing function, which is not convex, the overall learning problem is not convex. What does (stochastic) (sub)gradient descent do with non-convex functions? 20 / 22

Learning with a Two-Layer Network Parameters: W R H d, b R H, and v R H If we choose a differentiable loss, then the the whole function will be differentiable with respect to all parameters. Because of the squashing function, which is not convex, the overall learning problem is not convex. What does (stochastic) (sub)gradient descent do with non-convex functions? It finds a local minimum. To calculate gradients, we need to use the chain rule from calculus. 21 / 22

Learning with a Two-Layer Network Parameters: W R H d, b R H, and v R H If we choose a differentiable loss, then the the whole function will be differentiable with respect to all parameters. Because of the squashing function, which is not convex, the overall learning problem is not convex. What does (stochastic) (sub)gradient descent do with non-convex functions? It finds a local minimum. To calculate gradients, we need to use the chain rule from calculus. Special name for (S)GD with chain rule invocations: backpropagation. 22 / 22