Do you like to be successful? Able to see the big picture

Similar documents
Lecture 14: Deep Generative Learning

Generative adversarial networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Adversarial Examples

Wasserstein GAN. Juho Lee. Jan 23, 2017

Generative Adversarial Networks. Presented by Yi Zhang

Classification: The rest of the story

Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks

6.036 midterm review. Wednesday, March 18, 15

arxiv: v1 [cs.lg] 20 Apr 2017

Generative Adversarial Networks

Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab,

Logistic Regression. COMP 527 Danushka Bollegala

Singing Voice Separation using Generative Adversarial Networks

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Experiments on the Consciousness Prior

MMD GAN 1 Fisher GAN 2

Deep Learning for Computer Vision

GENERATIVE ADVERSARIAL LEARNING

Notes on Adversarial Examples

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Machine Learning Lecture 5

J. Sadeghi E. Patelli M. de Angelis

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Deep Generative Models. (Unsupervised Learning)

Training Generative Adversarial Networks Via Turing Test

STA 414/2104: Lecture 8

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Generative Adversarial Networks, and Applications

GANs. Machine Learning: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM GRAHAM NEUBIG

Introduction to Convolutional Neural Networks (CNNs)

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

A QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS

Neural Networks and Deep Learning

Demystifying deep learning. Artificial Intelligence Group Department of Computer Science and Technology, University of Cambridge, UK

Multiclass Classification-1

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Security Analytics. Topic 6: Perceptron and Support Vector Machine

Linear Models for Classification

Learning From Data: Modelling as an Optimisation Problem

OPTIMIZATION METHODS IN DEEP LEARNING

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

CS534 Machine Learning - Spring Final Exam

Multi-Layer Boosting for Pattern Recognition

Image classification via Scattering Transform

Learning from Data Logistic Regression

Automatic Differentiation and Neural Networks

What is semi-supervised learning?

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Introduction to Machine Learning Midterm Exam Solutions

Logistic Regression. Machine Learning Fall 2018

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Convolutional Neural Networks

Some theoretical properties of GANs. Gérard Biau Toulouse, September 2018

CSCI-567: Machine Learning (Spring 2019)

CMSC858P Supervised Learning Methods

Advanced computational methods X Selected Topics: SGD

Empirical Risk Minimization

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

When Dictionary Learning Meets Classification

Understanding GANs: Back to the basics

Deep Neural Networks (1) Hidden layers; Back-propagation

Dreem Challenge report (team Bussanati)

Reinforcement Learning

Image Processing 2. Hakan Bilen University of Edinburgh. Computer Graphics Fall 2017

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Brief Introduction to Machine Learning

Qualifying Exam in Machine Learning

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

CMU-Q Lecture 24:

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Intelligent Systems Discriminative Learning, Neural Networks

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

Day 3 Lecture 3. Optimizing deep networks

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

COMP 562: Introduction to Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Midterm Exam, Spring 2005

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Linear Classifiers and the Perceptron

Linear classifiers: Logistic regression

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Linear and Logistic Regression. Dr. Xiaowei Huang

18.9 SUPPORT VECTOR MACHINES

Tufts COMP 135: Introduction to Machine Learning

How to Measure the Objectivity of a Test

Transcription:

Do you like to be successful? Able to see the big picture 1

Are you able to recognise a scientific GEM 2

How to recognise good work? suggestions please item#1 1st of its kind item#2 solve problem item#3 learning process effort put in value added 3

All these works resulted in Nobel Prizes Description Enrico Fermi's seminal paper on weak interaction, 1933 Reason for rejection It contained speculations too remote from reality to be of interest to the reader Hans Krebs' paper on the citric acid cycle, AKA the Krebs cycle, 1937 Nature publishing had a huge backlog Murray Gell-Mann's work on classifying the elementary particles, 1953 The invention of the radioimmunoassay, 1955 The discovery of quasicrystals, 1984 Paper outlining nuclear magnetic resonance (NMR) spectroscopy, 1966 He did not use a name editor like for his discovery reviewers were skeptical that humans could make antibodies small enough to bind to things like insulin It was rejected on the grounds that it will not interest physicists rejected twice by the J. Chem. Phys. to be finally accepted and published in the Review of Scientific Instruments 4

Is your idea so radical that your professor thinks you are out of your mind? Do you discover nature? Are you trying to break conventional wisdom? If you are doing deep learning for healthcare, are you listening to good doctors? engineers building a tennis racket must ask the tennis players 5

Generative Adversarial Networks This (GAN),... is the most interesting idea in the last 10 years in ML, in my opinion. Yann LeCun 6

Generative 7

Generative sampled noise input image generated Generator Network https://www.linkedin.com/pulse/unsupervised-representation-learning-deep-generative-diego 8

Adversarial 9

When the generator network is not trained well Generator Network get nonsense instead of a nice picture 10

Training the generator via an adversarial the discriminator network Discriminator Network Generator Network 11

Training the generator via an adversarial the discriminator network Discriminator Network Generator Network 12

The original GAN diagram real data x x = g(z) Discriminator Network prediction y noise z Generator Network 13

Classical Supervised Learning The passive learner x+ x - SVM RandForest CNN MLP etc etc Training... 14

Classical Supervised Learning The passive learner It eats what you gave to it No data = dead learner Bad data = bad learner Has totally no control over data Retrain when data change a little bit Is this an intelligent machine? 15

Humans The active learner Select to pay attention or not pay attention Creative and perform thought experiments Generalises well, just little bit of data is ok We perform experiments to generate data Learn by analogy :: 举 反三 16

Passive Learner Humans Active Learner where is GAN? X X? Select to pay attention or not pay attention Creative and perform thought experiments Generalises well, just little bit of data is ok Performs experiments to generate data 17

Generative models and their problems 18

All generative models boils down to sampling data from a distribution (x 1,x 2 ) a 1 N(µ 1, 1)+a 2 N(µ 2, 2) 19

Sampling black / white pixels (x 1,x 2, x N ) exp( E(x 1, x N )/T ) Z Z = 2 N X x i = ±1 x 1,x 2, x N exp( E(x 1, x N )/T ) E(x 1, x N )= X number of neighboring x i with the same sign x1 x6 x11 x16 x21 x26 x2 x7 x12 x17 x22 x27 x3 x8 x13 x18 x23 x28 x4 x9 x14 x19 x24 x29 x5 x10 x15 x20 x25 x30 20

Sampling black / white pixels (x 1,x 2, x N ) exp(e(x 1, x N )/T ) x i = ±1 E(x 1, x N )= X number of neighboring x i with the same sign Z Z = 2 N X x 1,x 2, x N exp(e(x 1, x N )/T ) 21

How to sample is the key problem in generative models This one is easy (x 1,x 2 ) a 1 N(µ 1, 1)+a 2 N(µ 2, 2) This one is harder but can do with advanced methods (x 1,x 2, x N ) exp( E(x 1, x N )/T ) Z In general if we can build a distribution, theoretical or empirical distribution will do, we can find a way to sample 22

The difficulty is building a distribution We don t have the slightest idea how the distribution looks like How can a distribution be build in a high dimension space where data is very sparse? e.g. images with one mega pixels of greyscale [0,255], is embedded in a volume 255^{1,000,000} 23

Image data are certainly always very very sparse, it is believed that they lie in some very complex manifold. Many real world data too occupy infinitesimal volume of feature space and form very complex manifold. https://www.researchgate.net/figure/286513346_fig1_figure-10-the-swiss-roll-dataset-a-the-euclidean-distance-bluedashed-line-of-two 24

Given a data set x i Distribution can be constructed by finding a function f(x) that is large `near data points and small elsewhere by minimising a cost function over f(x) C = Z x near x i f(x 0 )dx 0 + Z x everywhere else f(x 0 )dx 0 25

C = Z x near x i f(x 0 )dx 0 + Z x everywhere else f(x 0 )dx 0 f(x) f(x1,x x x 26 x

Given a data set x i Distribution can be constructed by finding a function f(x) that is large `near data points and small elsewhere by minimising a cost function over f(x) C = Z x near x i f(x 0 )dx 0 + Z x everywhere else f(x 0 )dx 0 Two difficulties here: Defining `near - regularisation issue High dimensional integrals, 1,000,000 dimension for image data 27

Neural networks offer a pseudo-solution 28

Lets start by reviewing the classical supervise learning on two classes x+ Classifier high value e.g. 1.0 29

Lets start by reviewing the classical supervise learning on two classes x - Classifier low value e.g. 0.0 30

x i Classifier w f w (x i ) Minimize a cost function with respect to C(w) = 1 N + N + X i f w (x i )+ 1 N w NX j f w (x j ) An example C(w) = 0 @ 1 N + X i log(d w (x i )) + 1 N X j 1 log(1 D w (x j )) A D w (x i ) 2 (0, 1) 31

f(x1,x2) x1 x2 32

f(x1,x2) x2 x1 x2 x1 33

Idea of GAN (Generative Adversarial Networks) Given a set of example real data Generate fake data that looks like read data Generator is a neural network parametrised by Sample a set of random noise z i i =1, N Generate set of fake data from sampled noise g (z i ) i =1, N 34

Generator Network z 1 z 2... g (z 1 ) g (z 2 )... z N g (z N ) If we adjust well, g (z i ) will look like this 35

Training Generator network using Discriminator network Discriminator network needs positive and negative training examples x i Discriminator w f w (x i ) C(w) = 1 N + N + X i f w (x i )+ 1 N NX j f w (x j ) put real data examples here put fake generated data here 36

First train the Discriminator network C(w, ) = 1 N + N + X i f w (x i )+ 1 N NX j f w (g (z j )) put real data examples here put fake generated data here w = arg min w C(w, ) keeping constant Discriminator network is trained to grade generated data as negative It criticised any imperfections of the generator 37

Discriminator : x -> 10 -> 10 -> 10 -> out 38

Next train the Generator network C(w, ) = 1 N + N + X i f w (x i )+ 1 N NX j f w (g (z j )) put real data examples here put fake generated data here w = arg min w C(w, ) keeping constant = arg max C(w, ) keeping w constant 39

Generator : x -> 5 -> 5 -> 5 -> out 40

C(w, ) = Cost function use in original GAN paper 0 @ 1 N + X i log(d w (x i )) + 1 N X j 1 log(1 D w (g (z j ))) A w = arg min w C(w, ) keeping constant = arg max C(w, ) keeping w constant w 1.Randomly initialise and 2.Use g (z i ) to generate fake data 3.Keep w constant and optimise for some n iterations 4.Keep constant and optimise for some k iterations w 41

Discriminator : x -> 10 -> 10 -> 10 -> out Generator : x -> 5 -> 5 -> 5 -> out 42

Discriminator : x -> 10 -> 10 -> 10 -> out Generator : x -> 5 -> 5 -> 5 -> out 43

Shortcomings of GAN 44

When two adversarial of equal strength competes both can improve 45

Discriminator Generator 46

Discriminator fail Discriminator : x -> 10 -> 10 -> 10 -> out Generator : x -> 5 -> 5 -> 5 -> out 47

Generator fail due to zero gradient 48

Generator fail due to zero gradient (surrounded) 49

Mode collapse 50

Improved Techniques for Training GANs T. Salimans, I Goodfellow et al 51

Generator Network 1. Feature matching for training the generator Discriminator Network ~f(x) Cost function for generator C g (w, ) = 1 N X ~f w (x i ) i 1 M X j ~f w (g (z j )) 2 2 = arg min C g (w, ) 52

Discriminator cost function use in original GAN paper C(w, ) = 0 @ 1 N + X i log(d w (x i )) + 1 N X j 1 log(1 D w (g (z j ))) A = arg max C(w, ) keeping w constant 53

2. Minibatch discrimination for training the discriminator ~f(x i ) T ~o(x i ) Read paper for details o contains information on how this data point cluster with other data points in the minibatch Discriminator Network ~f(x) Concatenate features to create a longer feature ~f(xi ), ~o(x i ) Put this new feature back into subsequent layers of discriminator 54

3. Historical averaging Add a term in cost function 1 t X i i 2 Effect, at every step, theta does not change too much in magnitude and direction 55

4. One-sided label smoothing Changes in discriminator training Instead of asking discriminator to output 1 for real data, ask discriminator to output a softer value like 0.9 for real data. Fake data still has label 0 Encourage a softer decision boundary, e.g. don t need to saturate out the sigmoid and make gradient zero 56

5. Virtual batch normalisation Instead of normalising differently for each minibatch, use a reference batch so that normalisation is consistent across minbatches. What I would do is to use one statistics for normalisation across all data, e.g. compute z-score using one mean and one variance What paper proposed is to recalculate mean and variance for each data point, which will increase computational time a lot 57

Wasserstein GAN 58

Earth Mover s Distance 1 2 3 4 5 6 7 8 9 cost = 3 1 2 3 4 5 6 7 8 9 59

Earth Mover s Distance A B C D E 1 2 3 4 5 6 7 8 9 cost = 14 A(2) B(2) C(3) D(3) E(4) 1 2 3 4 5 6 7 8 9 60

Earth Mover s Distance A B C D E 1 2 3 4 5 6 7 8 9 cost = 12 A(2) C(1) E(1) B(3) D(5) 1 2 3 4 5 6 7 8 9 61

Earth Mover s Distance A B C D E F G H I 1 2 3 4 5 6 7 8 9 B(3) D(3) C(5) cost = 22 G(0) H(0) F(3) E(2) I(0) A(6) 1 2 3 4 5 6 7 8 9 62

Distribution of real and fake data P r (x) is the distribution of real data, it is constant P (g (z)) is the distribution of generated data it changes when generator change W (P r (x),p (g (z))) is the Earth Mover s Distance between real and generated data 63

Idea of Wasserstein GAN No discriminator (in the paper they use the word critic ) Given P r (x) Initialize g (z i ) and generate P (g (z)) Adjust generator such that W (P r (x),p (g (z))) is minimised 64

Kantorovich-Rubinstein duality I did not prove it 0 1 W (P r,p )= sup kfk<1 @ 1 N X i f(x i ) 1 M X j f(g(z j )) A over 1-Lipschitz functions (loosely speaking gradient does not exceed one) Sample from P r (x) Sample from P (g (z)) 65

Kantorovich-Rubinstein duality I did not prove it 0 1 W (P r,p )= sup kfk<1 @ 1 N X i f(x i ) 1 M X j f(g(z j )) A Sample from P r (x) Sample from P (g (z)) This is a constrained optimisation problem! 66

Representation of 1-Lipschitz function One way to parameterise a function is to use a neural network. By adjusting weights, we create different functions. When network is small, class of function obtained is limited. When network is deep, we can represent more functions. 67

W (P r,p )= sup kfk<1 0 @ 1 N X i f(x i ) 1 M X j 1 f(g(z j )) A Represent f( ) by a neural network f w ( ) Use gradient ascend to find W (P r (x),p (g (z))) 68

Algorithm Initialize w and Generates P (g (z)) (1) Maximize by adjusting w 0 W (P r,p )= sup @ 1 X f w (x i ) kfk<1 N i 1 M X j 1 f w (g(z j )) A (2) Minimize Earth Mover s distance arg min W (P r,p ) (3) Repest (1) & (2) 69

Assignment #3: There is something strange about this algorithm. What is it?

Deep Convolutional GAN A. Radford, L. Metz, S. Chintala Unsupervised Representation Learning with Deep Convolutional GAN, ICLR2016 72

one epoch 73

5 epoch 74

one epoch 5 epoch 75

Generator 76

77

Every 10 iterations Final generated digits 78

Every 10 iterations Final generated digits 79

Every 10 iterations Final generated digits 80

Every 10 iterations Final generated digits 81

BiGAN Bidirectional GAN 82

x Discriminator Network y z Generator Network x = g(z) 83

Give z, generate x noise z Generator Network x = g(z) how about give x, generate z z = E(x) x Encoder Network 84

x Discriminator Network y z Generator Network x = g(z) 85

z = E(x) Encoder Network x Discriminator Network y z Generator Network x = g(z) 86

z = E(x) Encoder Network x x,z x,z Discriminator Network y z Generator Network x = g(z) 87

C(w,, )= 0 @ 1 N + X i Cost function log{d w (x i,e (x i ))} + 1 N X j 1 log{1 D w (g (z j ),z j } A max, min w C(w,, ) Adjust generator and encoder such that all fake data looks like real ones g (z j )=x 0! x E (x i )=z 0! z By this procedure, g(e (x i ))! x i E g 1 88

x g(e(x)) x g(e(x)) 89

x g(e(x)) x g(e(x)) x g(e(x)) x g(e(x)) 90

Assignment #3 we will announce this weekend 91

92