Lecture 13: Introduction to Neural Networks

Similar documents
Lecture 16: Introduction to Neural Networks

Neural Networks: Introduction

Computational Learning Theory

THE MOST IMPORTANT BIT

CSC321 Lecture 5: Multilayer Perceptrons

18.6 Regression and Classification with Linear Models

CS 395T Computational Learning Theory. Scribe: Mike Halcrow. x 4. x 2. x 6

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Multilayer Perceptrons and Backpropagation

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Optimization Methods for Machine Learning (OMML)

Name (NetID): (1 Point)

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Qualifier: CS 6375 Machine Learning Spring 2015

Computational Learning Theory

Introduction to Machine Learning

CSC242: Intro to AI. Lecture 21

Computational Learning Theory (VC Dimension)

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

CS446: Machine Learning Spring Problem Set 4

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Automatic Differentiation and Neural Networks

SGD and Deep Learning

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Neural Network Language Modeling

Multiclass Classification-1

Name (NetID): (1 Point)

Lecture 8. Instructor: Haipeng Luo

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CS 590D Lecture Notes

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Supervised Learning. George Konidaris

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Machine Learning (CSE 446): Backpropagation

COMS 4771 Introduction to Machine Learning. Nakul Verma

Statistical and Computational Learning Theory

Lecture 4: Perceptrons and Multilayer Perceptrons

1 More finite deterministic automata

PAC-learning, VC Dimension and Margin-based Bounds

Lecture 24: Approximate Counting

10-701/15-781, Machine Learning: Homework 4

Nonlinear Classification

Foundations of Machine Learning and Data Science. Lecturer: Avrim Blum Lecture 9: October 7, 2015

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

1 What a Neural Network Computes

Computational Learning Theory. CS534 - Machine Learning

Machine Learning (CSE 446): Neural Networks

10.1 The Formal Model

Revision: Neural Network

2 Upper-bound of Generalization Error of AdaBoost

FINAL: CS 6375 (Machine Learning) Fall 2014

Boolean circuits. Lecture Definitions

CS 350 Algorithms and Complexity

CIS 2033 Lecture 5, Fall

Qualifying Exam in Machine Learning

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Supervised Learning in Neural Networks

P P P NP-Hard: L is NP-hard if for all L NP, L L. Thus, if we could solve L in polynomial. Cook's Theorem and Reductions

Week 5: The Backpropagation Algorithm

The Perceptron algorithm

Lecture 4 : Quest for Structure in Counting Problems

Neural networks COMS 4771

10 NEURAL NETWORKS Bio-inspired Multi-Layer Networks. Learning Objectives:

ECE521 Lectures 9 Fully Connected Neural Networks

Reification of Boolean Logic

Deep Neural Networks (1) Hidden layers; Back-propagation

Computational and Statistical Learning Theory

Aditya Bhaskara CS 5968/6968, Lecture 1: Introduction and Review 12 January 2016

Show that the following problems are NP-complete

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Time to learn about NP-completeness!

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

9 Classification. 9.1 Linear Classifiers

Neural Networks and Ensemble Methods for Classification

Artificial Intelligence

Machine Learning Lecture 7

Computational Learning Theory

CS Lecture 4. Markov Random Fields

Simulating Neural Networks. Lawrence Ward P465A

AI Programming CS F-20 Neural Networks

Deep Learning: Self-Taught Learning and Deep vs. Shallow Architectures. Lecture 04

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

IFT Lecture 7 Elements of statistical learning theory

CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning

Introduction to Machine Learning

CSC321 Lecture 9 Recurrent neural nets

CS 350 Algorithms and Complexity

ECE521 Lecture7. Logistic Regression

Hierarchical Concept Learning

2 P vs. NP and Diagonalization

Week 5: Logistic Regression & Neural Networks

FORMULATION OF THE LEARNING PROBLEM

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Artificial Neural Networks Examination, March 2004

Lecture 10: Boolean Circuits (cont.)

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber

Feedforward Neural Nets and Backpropagation

Lecture 17: Neural Networks and Deep Learning

Artificial Neural Networks

Transcription:

Lecture 13: Introduction to Neural Networks Instructor: Aditya Bhaskara Scribe: Dietrich Geisler CS 5966/6966: Theory of Machine Learning March 8 th, 2017 Abstract This is a short, two-line summary of the day s lecture. Should provide a rough set of topics covered or questions discussed. 1 Recap: Boosting In previous lecture, we discussed boosting in learning. Boosting is the principle that, given a weak learner for a class, it is possible to obtain a strong learning. More formally, given a black box of weak learners, it is possible to produce a strong learning algorithm from the collection of weak learning algorithms. 2 Formalizing Feed-Forward Neural Networks The first type of neural network we will explore is a feed-forward neural network. Feed-forward neural networks take a collection of inputs, pass these inputs through a series of Hidden Layers (HLs) composed of neurons, and use this mapping to produce an output. This output can be of any dimension and does not depend on the dimension of the input. Each neuron in the network is an entity connecting inputs to other neurons (or the output). Each of these essentially functions as a map from a given set of inputs values to an output value. For a given neuron, each incoming input is multiplied by a weight w i which is modified as new data is given to allow learning. The neuron has some function f by which this value is produced. This means that for some neuron with inputs x 1, x 2...., x k, weights w 1, w 2,..., w k, and function f, the following output is given: (1) f (w 1 x 1 + w 2 x 2 + + w k x k ) 1

3. Structure of f vt (x 1,..., x n ) The choice of f arbitrary; however we generally require that f : R n [0, 1] and expect that f is nonlinear. Common choices for f include the threshold function: And the sigmoid function: Other common choices for f exist but will be considered later. The final output of a given neural network after moving through each layer for some initial inputs x 1, x 2,..., x n is given by the function, where f (w x ) is a neuron as given in Eq. (1): (2) f vt (x 1, x 2..., x n ) = f (w, [ f (w,... [ f (w, x),... f (w, x)])]) This equation implies some useful properties of neural networks. First, since Eq. (2) is deterministic, the result of an arbitrary feed-forward neural network is always deterministic; that is, an unchanged neural network always produces the same value given the same set of inputs. Second, the total function provided by the neural network can be nonlinear, even though each neuron is a nonlinear function of a linear combination of inputs. Finally, the nonlinearity of the resulting function described in Eq. (2) allows for modeling of complicated patterns and interactions. The power apparently granted by neural networks, however, presumes two necessary questions. First, what does f vt (x 1,..., x n ) actually look like? Second, can we find a nueral network that computes some f vt given enough training examples? 3 Structure of f vt (x 1,..., x n ) We will denote the number of layers of a neural network with t 1 and observe that the number of hidden layers in such a network equal to t 1. Let us first consider a simple neural network, where t = 1. This is just a direct mapping from our inputs x 1, x 2,..., x n to the output. In other words, we just have that f vt = f, where f is the function given in Eq. (1). We will now examine the neural networks with t = 2, noting that this implies we have one hidden layer. Using Eq. (1) and Eq. (2), we obtain the function: (3) 2

f vt ( x1, x2..., xn ) = f (w1 f 1 (w11 x1 + + w1j x j ) +... wi f i (wi1 x1 + + wij x j )) This can be visualized as follows (note that the two diagrams are equivalent):44 The activation function given in Eq. (3) can represent any convex shape as a conjunction of linear classifiers, an illistration of which can be seen below: More generally, however, [Cybenko 89] and [Kolmogorov 50s] tell us that any arbitrary boolean function can be approximated to arbitrary accuracy by two-layer threshold neural networks. Showing this result is fairly technical, however, so we will instead illustrate a similar case where we assume boolean functions over the domain {0, 1}n. Specifically, we will show that any boolean function f : {0, 1}n {0, 1} can be written as a two-layer threshold network. The idea of this proof is as follows. Let us suppose that n = 3, which gives that inputs are of the form ( x1, x2, x3 ); thus a total of 23 = 8 inputs are possible. Since only outputs in the set {0, 1} are possible, some number (protentially 0) of these inputs produce a value of 1 and the remainder produce 0. Then we simply construct a neural network where we have a neuron corresponding to each output of 1 such that these neurons are only fired when precisely the input associated with this output is given. We then produce as our output 1 if at least one of these neurons fires and 0 otherwise. As an illustration, let us suppose that the inputs (1, 0, 0) and (0, 0, 1) produce a value of 1 and all other inputs produce a value of 0. We then construct a network with two neurons with the following functions, respectively: f 1 = ( x1 ) + (1 x2 ) + (1 x3 ) 3 f 2 = (1 x1 ) + (1 x2 ) + ( x3 ) 3 f vt = f 1 + f 2 1 Where each neuron provides a 1 if the given condition is true and 0 otherwise. Then we have perfectly modeled our input function. 3

4. Learning Power of Neural Networks This argument of course does not depend on n, which means that we can perform a similar operation on any size of input layer. We can therefore conclude that, so long as we don t restrict the number of nodes in the hidden layer, we can perfectly model any boolean function with a two-layer neural network. This begs the question, however, of what happens when we do limit the number of nodes in the hidden layer of our neural network, as we would have to in the real world? As might be expected from our construction above, there are indeed boolean functions that cannot be represented by two-layer neural network when we limit the number of neurons the hidden layer. A formalization of this idea is given in the theorem below. 3.1 theorem. There exists a boolean function f over n bits to which any threshold network computing f has size (V + E) 2 Ω ln n Proof. Before demonstrating this, however, we must bound the VC dimension of our limited networks. Define smallnets d = {The set of all threshold nets that have V + E d}. 3.2 theorem. VC(smallnets d ) d log d This proof is non-constructive since it is an argument by counting (*) Proof. The proof of this theorem is a bit technical and can be found in the textbook. Now that we have established this bound, we may proceed with demonstrating our original theorem. We first recall the intuition that the VC-dimension α of a learner gives the number of distinct parameters that can be learned by the algorithm. Consider the case where d log d < 2 n and observe that the alternative where d log d 2 n trivially shatters all boolean functions with fewer than 2 n inputs. Then we claim that the class of functions smallnets d excludes some boolean function with 2 n inputs. But the VC-dimension(H) is the size of the largest set that H can shatter. By (*), VC-dimension(smallnets d ) < 2 n. This means that for any set S of inputs of size 2 n, smallnets d cannot shatter S. Before moving to the next section, it is worth summarizing what we have discovered. Simple (small-depth) neural networks initially seem very powerful, and indeed allow representation of a variety of interesting functions. Realworld limitations, however, enforce bounds on the number of neurons and, therefore, the complexity of the functions that can be represented. We conclude by notin that depth is not sufficient to solve this problem, though it can help. Consider the question: do depth 4 networks (say size n) compute functions that can be computed by size 2 neural networks of size n 2? 4 Learning Power of Neural Networks Recall our question from earlier, namely: can we find a nueral network that computes some f vt given enough training examples? More precisely, is training neural networks a hard problem and, if so, how hard can it be to obtain arbitrary precision? Suppose we fix a neural network such that depth = 2, that is the network has two hidden layers. Can we train this network? More precisely, can we find 4

the optimal neural network to model some function given sample data? This problem is known as Empirical Risk Minization (ERM). We know the structure of the network, namely that we have two hidden layers and a set number of neurons. Our problem, then, is to find the values for w i and t i for all i such that we minimize the number of mistakes across the training set. This problem is generally very difficult, even for fairly simple networks such as the one given! 4.1 theorem. Even a network with just 3 neurons in the hidden layer is NP-Hard to learn ; that is, we can find input arrangements where finding the optimal neural network with this arrangement is NP-Hard. More generally, the ERM problem is NP-Hard. Proof. We will perform a reduction from the 3-coloring problem, a problem that is known to be NP-Complete. The 3-coloring problem is as follows. Given a graph G(V, E), determine if it possible to color v V with the colors red, green, and blue such that ( e E)(e = (u, v) u, v are colored with different colors). In other words, this coloring partitions the vertices of the graph in such a way that no edge maps between two elements of the same partition. Now, to show that the ERM problem for a two-layer neural network with 3 neurons is a reduction from the 3-coloring problem, take a graph G(V, E) and determine some training examples. In particular, it is possible to select those training examples such that ERM=0 iff G(V, E) is 3-colorable. As a hint, if a graph is 3-colorable, some edges to the output will be 1 while the rest will be 0. That is, we have two kinds of input (0, 0,..., 1,... 0, 0) 0, where 1 occurs at the ith position, and (0, 0,..., 1,..., 1,... 0, 0) 1, with 1s at the locations of the ends of a given edge (u, v). This will give that every incoming edge ahas an outgoing edge matching two the correct color as given by these inputs. 5