Deep Learning. Boyang Albert Li, Jie Jay Tan

Similar documents
CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

CS294A Lecture notes. Andrew Ng

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Supporting Information

Using deep belief network modelling to characterize differences in brain morphometry in schizophrenia

Introduction to the Introduction to Artificial Neural Network

CSC321 Lecture 9 Recurrent neural nets

CSC321 Lecture 15: Recurrent Neural Networks

CS294A Lecture notes. Andrew Ng

Hopfield networks and Boltzmann machines. Geoffrey Hinton et al. Presented by Tambet Matiisen

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Support Vector Machines

MATH 567: Mathematical Techniques in Data Science Lab 8

1 Convex Optimization

Support Vector Machines

Generalized Linear Methods

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Homework Assignment 3 Due in class, Thursday October 15

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Feature Selection: Part 1

Ensemble Methods: Boosting

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Neural Networks. Perceptrons and Backpropagation. Silke Bussen-Heyen. 5th of Novemeber Universität Bremen Fachbereich 3. Neural Networks 1 / 17

Boostrapaggregating (Bagging)

Week 5: Neural Networks

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Which Separator? Spring 1

Natural Images, Gaussian Mixtures and Dead Leaves Supplementary Material

Multi-layer neural networks

EEE 241: Linear Systems

Deep Belief Network using Reinforcement Learning and its Applications to Time Series Forecasting

Internet Engineering. Jacek Mazurkiewicz, PhD Softcomputing. Part 3: Recurrent Artificial Neural Networks Self-Organising Artificial Neural Networks

Discriminative classifier: Logistic Regression. CS534-Machine Learning

CSC 411 / CSC D11 / CSC C11

Deep Learning: A Quick Overview

CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning

Kernel Methods and SVMs Extension

Maxent Models & Deep Learning

Fundamentals of Neural Networks

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Singular Value Decomposition: Theory and Applications

Unsupervised Learning

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Lecture Notes on Linear Regression

VQ widely used in coding speech, image, and video

A New Evolutionary Computation Based Approach for Learning Bayesian Network

CSE 546 Midterm Exam, Fall 2014(with Solution)

Hidden Markov Models

Vapnik-Chervonenkis theory

SDMML HT MSc Problem Sheet 4

A neural network with localized receptive fields for visual pattern classification

Multigradient for Neural Networks for Equalizers 1

15-381: Artificial Intelligence. Regression and cross validation

Support Vector Machines

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

Nonlinear Classifiers II

Manifold Learning for Complex Visual Analytics: Benefits from and to Neural Architectures

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Deep Learning for Causal Inference

Atmospheric Environmental Quality Assessment RBF Model Based on the MATLAB

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

Online Classification: Perceptron and Winnow

Multilayer neural networks

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

Linear Feature Engineering 11

arxiv: v1 [cs.cv] 9 Nov 2017

Neural Networks. Class 22: MLSP, Fall 2016 Instructor: Bhiksha Raj

Spectral Clustering. Shannon Quinn

Simplified Stochastic Feedforward Neural Networks

VIDEO KEY FRAME DETECTION BASED ON THE RESTRICTED BOLTZMANN MACHINE

AMAS: Attention Model for Attributed Sequence Classification

Support Vector Machines

Markov Chain Monte Carlo Lecture 6

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

Statistical Machine Learning Methods for Bioinformatics III. Neural Network & Deep Learning Theory

Distributed and Stochastic Machine Learning on Big Data

Fast Tree-Structured Recursive Neural Tensor Networks

Pattern Classification

Lecture 12: Classification

Evaluation of classifiers MLPs

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

The Cortex. Networks. Laminar Structure of Cortex. Chapter 3, O Reilly & Munakata.

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Classification as a Regression Problem

Intro to Visual Recognition

Training Convolutional Neural Networks

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

SEMI-SUPERVISED LEARNING

Logistic Classifier CISC 5800 Professor Daniel Leeds

arxiv: v1 [cs.ne] 8 Apr 2016

Multilayer Perceptron (MLP)

COMPLEX NUMBERS AND QUADRATIC EQUATIONS

Gaussian Mixture Models

Clustering gene expression data & the EM algorithm

Learning Theory: Lecture Notes

Transcription:

Deep Learnng Boyang Albert L, Je Jay Tan

An Unrelated Vdeo A bcycle controller learned usng NEAT (Stanley)

What do you mean, deep? Shallow Hdden Markov models ANNs wth one hdden layer Manually selected and desgned features Deep Stacked Restrcted Boltzmann Machnes ANNs wth multple hdden layers Learnng complex features

Algorthms of Deep Learnng Recurrent Neural Networks Stacked Autoencoders (.e. deep neural networks) Stacked Restrcted Boltzmann Machnes (.e. deep belef networks) Convoluted Deep Belef Networks a growng lst

But What s Wrong wth Shallow? Needs more nodes / computng unts and weghts [Bengo, Y., et al. (2007). Greedy layerwse tranng of deep networks] Boolean functons (such as the functon that computes the multplcaton of two numbers from ther d-bt representaton) expressble by Olog layers of combnatoral logc wth O elements n each layer O2 elements when expressed wth only 2 layers Relance on manually selected features Automatcally learnng the features Dsentanglng nteractng factors, creatng nvarant features (wll come back to that)

Dsentanglng factors

Is the bran deep, too? http://thebran.mcgll.ca/flash/a/a_02/a_02_cr/a_02_cr_vs/a_02_cr_vs.html Erc R. Kandel. (2012) The Age of Insght: The Quest to Understand the Unconscous n Art, Mnd and Bran from Venna 1900 to the Present

A general algorthm for the bran? One part of the bran can learn the functon of another part If the vsual nput s sent to the audtory cortex of a newborn ferret, the "audtory" cells learn to do vson. (Sharma, Angelucc, and Sur. Nature 2000) People blnded at a young age can hear better, possbly because ther bran can stll adapt. (Gougoux et al. Nature 2004) Dfferent regons of the bran look smlar

Feature Learnng vs. Deep Neural Network pxels

Feature Learnng vs. Deep Neural Network pxels edges

Feature Learnng vs. Deep Neural Network pxels edges object parts

Feature Learnng vs. Deep Neural Network pxels edges object parts object models

Artfcal Neural Networks y h W ( x) x y Input Layer Hdden Layer Output Layer

Backpropagaton Mnmze Gradent computaton: 1 J ( w) hw ( x ) y 2 2 h w x y ) w 2 (2) (2) 11 w11 J ( ) w 1 ( ( ) a y (3) ( ) (3) ( ) a y a w (3) (2) 11 ( a y ) f ' a 4 j1 (3) (2) 1 f( w a ) w (2) (2) j1 1 (2) 11 2 h ( ) w x x

Backpropagaton J ( ) w 1 ( ( ) 2 h w x y ) w 2 (1) (1) 11 w11 (3) ( a y ) (3) ( a y ) a w (3) (1) 11 4 j1 a ( a ) ' (2) (3) 1 y f (1) w11 f( w a ) w (2) (2) j1 1 (1) 11 h ( ) w x x

More than one hdden layer? I thought of that, too. Ddn t work! Lack of data and computatonal power Weghts ntalzaton Poor local mnma Dffuson of gradent Overfttng A mult-layer model s too powerful / complex

Dffuson of Gradent s l 1 () l () l ( l1 ) ( l) ( wj j ) f ' j1 J ( w) l l a w () l j () ( 1) j

Dffuson of Gradent s l 1 () l () l ( l1 ) ( l) ( wj j ) f ' j1 J ( w) l l a w () l j () ( 1) j

Preventon of Overfttng Generatve Pre-tranng a way to ntalze the weghts Learnng p(x) or p(x, h) nstead of p(y x) Early stoppng Weght sharng and many other methods

Autoencoders x x h W ( x) w arg mn x x w () () 2

Sparse Autoencoder x x h W ( x)

Sparse Autoencoder 2 x a a (2) 1 0 (2) 2 (2) a n

Sparse Autoencoder x x h W ( x) w arg mn ( x x S( a )) w 2 ( 2 ) 2

Sparsty Regularzer L 0 norm: S( a) I( a 0)

Sparsty Regularzer L 0 norm: L 1 norm: S( a) I( a 0) S( a) a

Sparsty Regularzer L 0 norm: L 1 norm: S( a) I( a 0) L 2 norm: S( a) a S( a) a 2

Sparsty Regularzer L 0 norm: L 1 norm: S( a) I( a 0) S( a) a L 2 norm: S( a) a 2

L 1 vs. L 2 Regularzer

Effcent sparse codng Lee et al. (2006) Effcent sparse codng algorthms. NIPS a a a

Dmenson Reducton vs. Sparsty vs.

Vsualze a Traned Autoencoder Suppose the autoencoder s traned on 10 * 10 mages: 100 (2) j j j1 a f( W x )

Vsualze a Traned Autoencoder (2) a What mage wll maxmally actvate? Less formally, what s the feature that hdden unt s lookng for? 100 j1 max f ( Wx) x j j j

Vsualze a Traned Autoencoder What mage wll maxmally actvate (2)? Less formally, what s the feature that hdden unt s lookng for? a 100 j1 max f ( Wx) x 100 j st.. j1 x 2 j 1 j j

Vsualze a Traned Autoencoder (2) a What mage wll maxmally actvate? Less formally, what s the feature that hdden unt s lookng for? 100 j1 max f ( Wx) x j st.. 100 j1 x 2 j 1 j j x j 100 j1 W j ( W ) j 2

Vsualze a Traned Autoencoder

Tran a Deep Autoencoder x x

Tran a Deep Autoencoder

Tran a Deep Autoencoder

Tran a Deep Autoencoder Fne Tunng x x

Tran a Deep Autoencoder x Feature Vector

Tran an Image Classfer x Image Label (car or people)

Vsualze a Traned Autoencoder

Learnng Independent features? Le, Zou, Yeung, and Ng, CVPR 2011 Invarant features, dsentangle factors Introducng ndependence to mprove the results

Results

Recurrent Neural Networks Sutskever, Martens, Hnton. 2011. Generatng Text wth Recurrent Neural Networks. ICML y x

RNN to predct characters 1500 hdden unts 1500 hdden unts c character: 1 of 86 softmax predcted dstrbuton for next character. It s a lot easer to predct 86 characters than 100,000 words.

A sub-tree n the tree of all character strngs There are exponentally many nodes n the tree of all character strngs of length N. n fxn fx...fx e fxe In an RNN, each node s a hdden state vector. The next character must transform ths to a new node. If the nodes are mplemented as hdden states n an RNN, dfferent nodes can share structure because they use dstrbuted representatons. The next hdden representaton needs to depend on the conjuncton of the current character and the current hdden representaton.

Multplcatve connectons Instead of usng the nputs to the recurrent net to provde addtve extra nput to the hdden unts, we could use the current nput character to choose the whole hdden-to-hdden weght matrx. But ths requres 86x1500x1500 parameters Ths could make the net overft. Can we acheve the same knd of multplcatve nteracton usng fewer parameters? We want a dfferent transton matrx for each of the 86 characters, but we want these 86 character-specfc weght matrces to share parameters (the characters 9 and 8 should have smlar matrces).

Group a Usng factors to mplement multplcatve nteractons We can get groups a and b to nteract multplcatvely by usng factors. Each factor frst computes a weghted sum for each of ts nput groups. Then t sends the product of the weghted sums to ts output group. u f f Group b v f w f Group c c f vector of nputs to group c b T w f scalar nput to f from group b a T u f scalar nput to f from group a v f

He was elected Presdent durng the Revolutonary War and forgave Opus Paul at Rome. The regme of hs crew of England, s now Arab women's cons n and the demons that use somethng between the characters ssters n lower col trans were always operated on the lne of the ephemerable street, respectvely, the graphc or other faclty for deformaton of a gven proporton of large segments at RTUS). The B every chord was a "strongly cold nternal palette pour even the whte blade.

The meanng of lfe s 42? The meanng of lfe s the tradton of the ancent human reproducton: t s less favorable to the good boy for when to remove her bgger.

Is RNN deep enough? Ths deep structure provdes memory, not herarchcal processng Addng herarchcal processng Pascanu, Gulcehre, Cho, and Bengo (2013)

Why Unsupervsed Pre-tranng Works From Bengo s talk Optmzaton Hypothess Unsupervsed tranng ntalzes weghts near localtes of better mnma than random ntalzaton can. Regularzaton Hypothess (Prevent over-fttng) The unsupervsed pre-tranng dataset s larger. Features extracted from unsupervsed set are more general and have better dscrmnant power.

Why Unsupervsed Pre-tranng Works Bengo: Learnng P(x) or P(x, h), whch helps you wth P(y x) Structures and features that can generate the nputs (no matter f a probablstc formulaton s used) also happen to be useful for your supervsed task Ths requres P(x) and P(y x) to be smlar,.e. smlarly lookng x produces smlar y Ths s probably more true for vson / audo than for texts

Concluson Motvaton for deep learnng Backpropagaton Autoencoder and sparsty Generatve, layerwse pre-tranng (Stacked Autoencoder) Recurrent Neural Networks Speculaton of why these thngs work