MATH 567: Mathematical Techniques in Data Science Lab 8

Similar documents
Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Multi-layer neural networks

Multilayer neural networks

CS294A Lecture notes. Andrew Ng

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester

CS294A Lecture notes. Andrew Ng

EEE 241: Linear Systems

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Evaluation of classifiers MLPs

Supporting Information

Week 5: Neural Networks

1 Convex Optimization

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Generalized Linear Methods

Multilayer Perceptron (MLP)

Multigradient for Neural Networks for Equalizers 1

Lecture 23: Artificial neural networks

Lecture 10 Support Vector Machines II

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Introduction to the Introduction to Artificial Neural Network

1 Input-Output Mappings. 2 Hebbian Failure. 3 Delta Rule Success.

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

Neural Networks. Perceptrons and Backpropagation. Silke Bussen-Heyen. 5th of Novemeber Universität Bremen Fachbereich 3. Neural Networks 1 / 17

Solving Nonlinear Differential Equations by a Neural Network Method

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

10-701/ Machine Learning, Fall 2005 Homework 3

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Gradient Descent Learning and Backpropagation

Linear Feature Engineering 11

Ensemble Methods: Boosting

SDMML HT MSc Problem Sheet 4

Lecture Notes on Linear Regression

Classification as a Regression Problem

Deep Learning. Boyang Albert Li, Jie Jay Tan

Multi layer feed-forward NN FFNN. XOR problem. XOR problem. Neural Network for Speech. NETtalk (Sejnowski & Rosenberg, 1987) NETtalk (contd.

Hopfield networks and Boltzmann machines. Geoffrey Hinton et al. Presented by Tambet Matiisen

Neural Networks. Class 22: MLSP, Fall 2016 Instructor: Bhiksha Raj

Logistic Classifier CISC 5800 Professor Daniel Leeds

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Kernel Methods and SVMs Extension

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Fourier Transform. Additive noise. Fourier Tansform. I = S + N. Noise doesn t depend on signal. We ll consider:

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Tracking with Kalman Filter

Video Data Analysis. Video Data Analysis, B-IT

CSE 546 Midterm Exam, Fall 2014(with Solution)

Internet Engineering. Jacek Mazurkiewicz, PhD Softcomputing. Part 3: Recurrent Artificial Neural Networks Self-Organising Artificial Neural Networks

Model of Neurons. CS 416 Artificial Intelligence. Early History of Neural Nets. Cybernetics. McCulloch-Pitts Neurons. Hebbian Modification.

Feature Selection: Part 1

A neural network with localized receptive fields for visual pattern classification

Homework Assignment 3 Due in class, Thursday October 15

Pattern Classification

Why feed-forward networks are in a bad shape

VQ widely used in coding speech, image, and video

Lecture 10 Support Vector Machines. Oct

Online Classification: Perceptron and Winnow

Logistic Regression Maximum Likelihood Estimation

Fundamentals of Computational Neuroscience 2e

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

Technical Report: Multidimensional, Downsampled Convolution for Autoencoders

Using deep belief network modelling to characterize differences in brain morphometry in schizophrenia

THE SUMMATION NOTATION Ʃ

Linear Classification, SVMs and Nearest Neighbors

EEL 6266 Power System Operation and Control. Chapter 3 Economic Dispatch Using Dynamic Programming

2 S. S. DRAGOMIR, N. S. BARNETT, AND I. S. GOMM Theorem. Let V :(d d)! R be a twce derentable varogram havng the second dervatve V :(d d)! R whch s bo

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

A linear imaging system with white additive Gaussian noise on the observed data is modeled as follows:

Deep Learning: A Quick Overview

Lecture 21: Numerical methods for pricing American type derivatives

Deep Learning for Causal Inference

Kristin P. Bennett. Rensselaer Polytechnic Institute

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Manifold Learning for Complex Visual Analytics: Benefits from and to Neural Architectures

Neural Networks. Neural Network Motivation. Why Neural Networks? Comments on Blue Gene. More Comments on Blue Gene

Training Convolutional Neural Networks

Regression Analysis. Regression Analysis

Problem Set 9 Solutions

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

CSCI B609: Foundations of Data Science

COMP th April, 2007 Clement Pang

Support Vector Machines

1 Derivation of Point-to-Plane Minimization

Boostrapaggregating (Bagging)

Gaussian Mixture Models

Lab 2e Thermal System Response and Effective Heat Transfer Coefficient

The exam is closed book, closed notes except your one-page cheat sheet.

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Blocky models via the L1/L2 hybrid norm

Support Vector Machines

Chapter 9: Statistical Inference and the Relationship between Two Variables

The Study of Teaching-learning-based Optimization Algorithm

CS 468 Lecture 16: Isometry Invariance and Spectral Techniques

Classification (klasifikácia) Feedforward Multi-Layer Perceptron (Dopredná viacvrstvová sieť) 14/11/2016. Perceptron (Frank Rosenblatt, 1957)

Non-linear Canonical Correlation Analysis Using a RBF Network

MULTISPECTRAL IMAGE CLASSIFICATION USING BACK-PROPAGATION NEURAL NETWORK IN PCA DOMAIN

Transcription:

1/14 MATH 567: Mathematcal Technques n Data Scence Lab 8 Domnque Gullot Departments of Mathematcal Scences Unversty of Delaware Aprl 11, 2017

Recall We have: a (2) 1 = f(w (1) 11 x 1 + W (1) 12 x 2 + W (1) 13 x 3 + b (1) 1 ) a (2) 2 = f(w (1) 21 x 1 + W (1) 22 x 2 + W (1) 23 x 3 + b (1) 2 ) a (2) 3 = f(w (1) 31 x 1 + W (1) 32 x 2 + W (1) 33 x 3 + b (1) 3 ) h W,b = a (3) 1 = f(w (2) 11 a(2) 1 + W (2) 12 a(2) 2 + W (2) 13 a(2) 3 + b (2) 1 ). 2/14

Recall (cont.) 3/14 Vector form: z (2) = W (1) x + b (1) a (2) = f(z (2) ) z (3) = W (2) a (2) + b (2) h W,b = a (3) = f(z (3) ).

Tranng neural networks 4/14 Suppose we have A neural network wth s l neurons n layer l (l = 1,..., n l ).

Tranng neural networks 4/14 Suppose we have A neural network wth s l neurons n layer l (l = 1,..., n l ). Observatons (x (1), y (1) ),..., (x (m), y (m) ) R s 1 R sn l. We would lke to choose W (l) and b (l) n some optmal way for all l.

Tranng neural networks 4/14 Suppose we have A neural network wth s l neurons n layer l (l = 1,..., n l ). Observatons (x (1), y (1) ),..., (x (m), y (m) ) R s 1 R sn l. We would lke to choose W (l) and b (l) n some optmal way for all l. Let J(W, b; x, y) := 1 2 h W,b(x) y 2 2 (Squared error for one sample).

Tranng neural networks 4/14 Suppose we have A neural network wth s l neurons n layer l (l = 1,..., n l ). Observatons (x (1), y (1) ),..., (x (m), y (m) ) R s 1 R sn l. We would lke to choose W (l) and b (l) n some optmal way for all l. Let J(W, b; x, y) := 1 2 h W,b(x) y 2 2 (Squared error for one sample). Dene J(W, b) := 1 m m J(W, b; x (), y () ) + λ 2 =1 (average squared error wth Rdge penalty). n l 1 s l s l+1 (W (l) j )2. l=1 =1 j=1

Tranng neural networks Suppose we have A neural network wth s l neurons n layer l (l = 1,..., n l ). Observatons (x (1), y (1) ),..., (x (m), y (m) ) R s 1 R sn l. We would lke to choose W (l) and b (l) n some optmal way for all l. Let J(W, b; x, y) := 1 2 h W,b(x) y 2 2 (Squared error for one sample). Dene J(W, b) := 1 m m J(W, b; x (), y () ) + λ 2 =1 (average squared error wth Rdge penalty). Note: The Rdge penalty prevents overttng. We do not penalze the bas terms b (l). n l 1 s l s l+1 (W (l) j )2. l=1 =1 j=1 4/14

Some remarks 5/14 Can use other loss functons (e.g. for classcaton). Can use other penaltes (e.g. l 1, elastc net, etc.).

Some remarks 5/14 Can use other loss functons (e.g. for classcaton). Can use other penaltes (e.g. l 1, elastc net, etc.). In classcaton problems, we choose the labels y {0, 1} (f workng wth sgmod) or y { 1, 1} (f workng wth tanh). For regresson problems, we scale the output so that y [0, 1] (f workng wth sgmod) or y [ 1, 1] (f workng wth tanh).

Some remarks 5/14 Can use other loss functons (e.g. for classcaton). Can use other penaltes (e.g. l 1, elastc net, etc.). In classcaton problems, we choose the labels y {0, 1} (f workng wth sgmod) or y { 1, 1} (f workng wth tanh). For regresson problems, we scale the output so that y [0, 1] (f workng wth sgmod) or y [ 1, 1] (f workng wth tanh). We can use gradent descent to mnmze J(W, b). Note that snce the functon J(W, b) s non-convex, we may only nd a local mnmum.

Some remarks 5/14 Can use other loss functons (e.g. for classcaton). Can use other penaltes (e.g. l 1, elastc net, etc.). In classcaton problems, we choose the labels y {0, 1} (f workng wth sgmod) or y { 1, 1} (f workng wth tanh). For regresson problems, we scale the output so that y [0, 1] (f workng wth sgmod) or y [ 1, 1] (f workng wth tanh). We can use gradent descent to mnmze J(W, b). Note that snce the functon J(W, b) s non-convex, we may only nd a local mnmum. We need an ntal choce for W (l) j and b (l). If we ntalze all the parameters to 0, then the parameters reman constant over the layers because of the symmetry of the problem.

Some remarks 5/14 Can use other loss functons (e.g. for classcaton). Can use other penaltes (e.g. l 1, elastc net, etc.). In classcaton problems, we choose the labels y {0, 1} (f workng wth sgmod) or y { 1, 1} (f workng wth tanh). For regresson problems, we scale the output so that y [0, 1] (f workng wth sgmod) or y [ 1, 1] (f workng wth tanh). We can use gradent descent to mnmze J(W, b). Note that snce the functon J(W, b) s non-convex, we may only nd a local mnmum. We need an ntal choce for W (l) j and b (l). If we ntalze all the parameters to 0, then the parameters reman constant over the layers because of the symmetry of the problem. As a result, we ntalze the parameters to a small constant at random (say, usng N(0, ɛ 2 ) for ɛ = 0.01).

Gradent descent and the backpropagaton algorthm 6/14 We update the parameters usng a gradent descent as follows: W (l) j b (l) W (l) j α b (l) α W (l) j b (l) J(W, b) J(W, b). Here α > 0 s a parameter (the learnng rate).

Gradent descent and the backpropagaton algorthm 6/14 We update the parameters usng a gradent descent as follows: W (l) j b (l) W (l) j α b (l) α W (l) j b (l) J(W, b) J(W, b). Here α > 0 s a parameter (the learnng rate). The partal dervatves can be cleverly computed usng the chan rule to avod repeatng calculatons (backpropagaton algorthm).

Sparse neural networks 7/14 Sparse networks can be bult by Penalzng coecents (e.g. usng a l 1 penalty). Droppng some of the connectons at random (dropout). Srvastava et al., JMLR 15 (2014). Useful to prevent overttng. Recent work: One-shot learners can be used to tran models wth a smaller sample sze.

Autoencoders 8/14 An autoencoder learns the dentty functon: Input: unlabeled data. Output = nput. Idea: lmt the number of hdden layers to dscover structure n the data. Learn a compressed representaton of the nput. Source: UFLDL tutoral.

Example (UFLDL) 9/14 Tran an autoencoder on 10 10 mages wth one hdden layer.

Example (UFLDL) 9/14 Tran an autoencoder on 10 10 mages wth one hdden layer. Each hdden unt computes: 100 = f a (2) j=1 j x j + b (1) j. W (1)

Example (UFLDL) 9/14 Tran an autoencoder on 10 10 mages wth one hdden layer. Each hdden unt computes: 100 = f a (2) j=1 j x j + b (1) j. W (1) Thnk of a (2) as some non-lnear feature of the nput x.

Example (UFLDL) 9/14 Tran an autoencoder on 10 10 mages wth one hdden layer. Each hdden unt computes: 100 = f a (2) j=1 j x j + b (1) j. W (1) Thnk of a (2) as some non-lnear feature of the nput x. Problem: Fnd x that maxmally actvates a (2) over x 2 1.

Example (UFLDL) 9/14 Tran an autoencoder on 10 10 mages wth one hdden layer. Each hdden unt computes: 100 = f a (2) j=1 j x j + b (1) j. W (1) Thnk of a (2) as some non-lnear feature of the nput x. Problem: Fnd x that maxmally actvates a (2) over x 2 1. Clam: x j = W (1) j 100. (1) j=1 (W j )2

Example (UFLDL) Tran an autoencoder on 10 10 mages wth one hdden layer. Each hdden unt computes: 100 = f a (2) j=1 j x j + b (1) j. W (1) Thnk of a (2) as some non-lnear feature of the nput x. Problem: Fnd x that maxmally actvates a (2) over x 2 1. Clam: x j = (Hnt: Use CauchySchwarz). W (1) j 100. (1) j=1 (W j )2 We can now dsplay the mage maxmzng a (2) for each. 9/14

Example (cont.) 10/14 100 hdden unts on 10x10 pxel nputs: The derent hdden unts have learned to detect edges at derent postons and orentatons n the mage.

Usng convolutons 11/14 Idea: Certan sgnals are statonary,.e., ther statstcal propertes do not change n space or tme.

Usng convolutons 11/14 Idea: Certan sgnals are statonary,.e., ther statstcal propertes do not change n space or tme. For example, mages often have smlar statstcal propertes n derent regons n space.

Usng convolutons 11/14 Idea: Certan sgnals are statonary,.e., ther statstcal propertes do not change n space or tme. For example, mages often have smlar statstcal propertes n derent regons n space. That suggests that the features that we learn at one part of an mage can also be appled to other parts of the mage.

Usng convolutons 11/14 Idea: Certan sgnals are statonary,.e., ther statstcal propertes do not change n space or tme. For example, mages often have smlar statstcal propertes n derent regons n space. That suggests that the features that we learn at one part of an mage can also be appled to other parts of the mage. Can convolve the learned features wth the larger mage.

Usng convolutons 11/14 Idea: Certan sgnals are statonary,.e., ther statstcal propertes do not change n space or tme. For example, mages often have smlar statstcal propertes n derent regons n space. That suggests that the features that we learn at one part of an mage can also be appled to other parts of the mage. Can convolve the learned features wth the larger mage. Example: 96 96 mage.

Usng convolutons 11/14 Idea: Certan sgnals are statonary,.e., ther statstcal propertes do not change n space or tme. For example, mages often have smlar statstcal propertes n derent regons n space. That suggests that the features that we learn at one part of an mage can also be appled to other parts of the mage. Can convolve the learned features wth the larger mage. Example: 96 96 mage. Learn features on small 8 8 patches sampled randomly (e.g. usng a sparse autoencoder).

Usng convolutons 11/14 Idea: Certan sgnals are statonary,.e., ther statstcal propertes do not change n space or tme. For example, mages often have smlar statstcal propertes n derent regons n space. That suggests that the features that we learn at one part of an mage can also be appled to other parts of the mage. Can convolve the learned features wth the larger mage. Example: 96 96 mage. Learn features on small 8 8 patches sampled randomly (e.g. usng a sparse autoencoder). Run the traned model through all 8 8 patches of the mage to get the feature actvatons.

Usng convolutons Idea: Certan sgnals are statonary,.e., ther statstcal propertes do not change n space or tme. For example, mages often have smlar statstcal propertes n derent regons n space. That suggests that the features that we learn at one part of an mage can also be appled to other parts of the mage. Can convolve the learned features wth the larger mage. Example: 96 96 mage. Learn features on small 8 8 patches sampled randomly (e.g. usng a sparse autoencoder). Run the traned model through all 8 8 patches of the mage to get the feature actvatons. Source: UFLDL tutoral. 11/14

Poolng features 12/14 Once can also pool the features obtaned va convoluton.

Poolng features 12/14 Once can also pool the features obtaned va convoluton. For example, to descrbe a large mage, one natural approach s to aggregate statstcs of these features at varous locatons.

Poolng features 12/14 Once can also pool the features obtaned va convoluton. For example, to descrbe a large mage, one natural approach s to aggregate statstcs of these features at varous locatons. E.g. compute the mean, max, etc. over derent regons.

Poolng features 12/14 Once can also pool the features obtaned va convoluton. For example, to descrbe a large mage, one natural approach s to aggregate statstcs of these features at varous locatons. E.g. compute the mean, max, etc. over derent regons. Can lead to more robust features. Can lead to nvarant features.

Poolng features 12/14 Once can also pool the features obtaned va convoluton. For example, to descrbe a large mage, one natural approach s to aggregate statstcs of these features at varous locatons. E.g. compute the mean, max, etc. over derent regons. Can lead to more robust features. Can lead to nvarant features. For example, f the poolng regons are contguous, then the poolng unts wll be translaton nvarant,.e., they won't change much f objects n the mage are undergo a (small) translaton.

Poolng features 12/14 Once can also pool the features obtaned va convoluton. For example, to descrbe a large mage, one natural approach s to aggregate statstcs of these features at varous locatons. E.g. compute the mean, max, etc. over derent regons. Can lead to more robust features. Can lead to nvarant features. For example, f the poolng regons are contguous, then the poolng unts wll be translaton nvarant,.e., they won't change much f objects n the mage are undergo a (small) translaton.

R 13/14 We wll use the package h2o to tran neural networks wth R. To get you started, we wll construct a neural network wth 1 hdden layers contanng 2 neurons to learn the XOR functon: # Intalze h2o lbrary(h2o) 0 1 0 0 1 1 1 0 h2o.nt(nthreads=-1, max_mem_sze="2g") h2o.removeall() # n case the cluster was # already runnng # Construct the XOR functon X = t(matrx(c(0,0,0,1,1,0,1,1),2,4)) y = matrx(c(-1,1,1,-1), 4) tran = as.h2o(cbnd(x,y))

R (cont.) 14/14 Tranng the model: # Tran model model <- h2o.deeplearnng(x = names(tran)[1:2], y = names(tran)[3], tranng_frame = tran, actvaton = "Tanh", hdden = c(2), nput_dropout_rato = 0.0, l1 = 0, epochs = 10000) # Test the model h2o.predct(model, tran) Some optons you may want to use when buldng more complcated models for data: actvaton = "RectferWthDropout" nput_dropout_rato = 0.2 l1 = 1e-5