DepthQualificationExam Presentation

Size: px

Start display at page:

Download "DepthQualificationExam Presentation"

Philip Wilkinson
6 years ago
Views:

1 DepthQualificationExam Presentation Li Wan, Dept. of Computer Science, Courant Institute, New York University Depth Qualification Exam Presentation p. 1/29

2 OverviewofTalk 1. Literature Survey (a) Approaches to train deep neural network (b) Topic models and its application to computer vision 2. Research Result: How to effectively combine neural network model with topic model. Depth Qualification Exam Presentation p. 2/29

3 Simple NeuralNetwork Non-linear activation function: h i = f(wi T x+b) where function f could be sigmod function σ(x) = 1/(1 + exp( x)) or tanh function (normalized to (0,1) or ( 1,1) scale) neural network [1]. [1] Bishop, Neural Network for Pattern Recognition, 1995 Depth Qualification Exam Presentation p. 3/29

4 DeepNeuralNetworks Deep network is better than shallow However, standard random initialization leads poor training and generalization error[1][3] in deep neural networks except deep CovNets[2]). [1] Bengio et al. Greedy layer-wise training of deep networks, NIPS, 2007 [2] LeCun et al. Back-propagation Applied to Handwritten Zip Code Recognition, Neural Comput [3] Hinton and Salakhutdinov, Reducing the dimensionality of data with neural networks, Science, 2006 [4] Glorot and Bengio, Understanding the difficulty of training deep feed forward neural networks, AISTATS, 2010 Depth Qualification Exam Presentation p. 4/29

5 Pre-trainingDeepNeuralNetworks how to Pre-train Deep Neural Networks: greedy layer-wise pre-training[1](x: input data, h hidden layer, and w parameters) Generative model[2][3](restricted Boltzmann Machines): w = argmax p(x,h;w) w Encoder(f)-decoder(g) model[4][5][6][7]: w = argmin w h f(x;w) + x g(h;w) +λ w 1 h [1] Hinton et al. A fast learning algorithm for eep belief nets, Neural Computation, 2006 [2] Hinton et al. Training products of experts by minimizing contrastive divergence, Neural Comput [3] Hinton and Salakhutdinov, Reducing the dimensionality of data with neural networks, Science, 2006 [4] Ranzato et al. Unsupervised Learning of Invariant Features Hierarchies with Applications to Object Recognition, CVPR, 2007 [5] Gregor and LeCun, Learning Fast Approximations of Sparse Coding, ICML, 2010 [6] Ranzato et al., Efficient learning of sparse representations with an energy-based model. NIPS, [7] Ranzato et al., Sparse Feature Learning for Deep Belief Networks, NIPS, Depth Qualification Exam Presentation p. 5/29

6 RestrictedBoltzmannMachines Undirected graphical model(bipartite graph U = {x} and V = {h}) with energy function(binary case): E(x,h) = b T x c T h h T Wx Fast inference: p(h x;w) = j p(h j x;w) = j σ(w jx+b) Fast sampling: p(x h;w) = i p(x i h;w) = i σ(wt i h+c). Neural network feed forward operation: f(x; w) = p(h x; w) Initialize W in neural network via maximize p(x;w) = hp(x,h;w) as follows: W = E data [hx T ] E model [hx T ] However, E model [hx T ] is intractable[1] because number of possible h is exponential to its size. Contrast Divergence[2][3] and its extensions[4] proposed to approximate model expectation with a few samples. [1] Long et al. Restricted Boltzmann Machines are Hard to Approximately Evaluate or Simulate [2] Hinton et al. Training products of experts by minimizing contrastive divergence, Neural Comput [3] Bengio and Delalleau, Justifying and Generalizing Contrastive Divergence, Neural Compt [4] Nair and Hinton, Rectified Linear Units Improve Restricted Boltzmann Machines, ICML, 2010 Depth Qualification Exam Presentation p. 6/29

7 Encoder-DecoderModel Encoding operation should preserve essential information of data x. Verify it by reconstruct x with decoder(g) based on h = f(x; w). Minimize encoding error h f(x; w) and decoding error x g(h; w) with proper penalty on λ w 1 to encourage local filters. t(h) is penalty term of code h to encourage special property such as spareness [3]. L(w,h) = h f(x;w) + x g(h;w) +λ w 1 +αt(h) Neural network feed forward operation is an encoding operation Learning W by repeat the following steps[1][2][3]: (with random initialize w) h 0 f(x;w) h t h t 1 +η L(w,h t 1) h t 1 a few steps with initial condition at h 0 w w +η L(w,h t) w [1] Ranzato et al. Unsupervised Learning of Invariant Features Hierarchies with Applications to Object Recognition, CVPR, 2007 [2] Ranzato et al., Efficient learning of sparse representations with an energy-based model. NIPS, [3] Ranzato et al., Sparse Feature Learning for Deep Belief Networks, NIPS, Depth Qualification Exam Presentation p. 7/29

8 ApplicationsofDeepNeuralNetwork Natural Image patches modeling [1][8] Image classification [2][5] Text Modeling [3] Human Pose Tracking [4] Digit Recognition [6][7] [1] Ranzato and Hinton, Modeling Pixel Means and Covariances Using Factorized Third-Order boltzmann Machines, CVPR 2010 [2] Lee et al. Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations, ICML 2009 [3] Salakhutdinov and Hinton, Using Deep Blief Nets to Learn Covariance Kernel for Gaussian Processes, NIPS, 2007 [4] Taylor et al. Dynamical Binary Latent Variable Models for 3D Human Pose Tracking, CVPR, 2010 [5] Ranzato et al. Unsupervised Learning of Invariant Features Hierarchies with Applications to Object Recognition, CVPR, 2007 [6] Salakhutdinov and Hinton, Deep Boltzmann Machines, AISTATS, 2009 [7] Salakhutdinov and Hugo, Efficient Learning of Deep Boltzmann Machines, AISTATS, 2010 [8] Osindero and Hinton, Modeling image patches with a directed hierarchy of Markov random fields, NIPS, 2006 Depth Qualification Exam Presentation p. 8/29

9 NeuralNetwork+GaussianRegression Given a data set x with label y, we are interested in the following probabilistic regression model: y = f(x)+ǫ with f(x) N(0,K) and ǫ N(0,σ 2 ) Here K ij = αexp( β(x i x j ) T (x i x j )) is covariance function. Loss function logp(y x) could be defined by integrate f(x) as follows: L = logp(y x) = 1 2 log K +σ2 I 1 2 yt (K +σ 2 I) 1 y +C 1. Gradient L/ x could be written down from definition 2. If x is response of neural network with input v, x/ v could be defined. 3. Back-propagation of the joint model is defined based on L/ x and x/ v [1]. [1] Salakhutdinov and Hinton, Using Deep Blief Nets to Learn Covariance Kernel for Gaussian Processes, NIPS, 2007 Depth Qualification Exam Presentation p. 9/29

10 TopicModels: LSA Latent Semantic Analysis [1]: map document to latent semantic space of reduced dimensionality. Given co-occurrence table X where each row is histogram of words Apply SVD to X: X = UΣV T. Approximate X by a few top singular values in Σ: X = U ΣV T UΣV T = X co-occurrence table in latent space: U Σ because inner product space is: XX T U Σ 2 U T. [1] Deerwester et al. Indexing by latent semantic analysis Depth Qualification Exam Presentation p. 10/29

11 Topic Models: plsa Probabilistic Latent Semantic Analysis [1] Joint distribution p(d,w) = p(d) z p(w z)p(z d) = z p(z)p(d z)p(w z) Relationship with LSA: U ik = p(d i z k ), V jk = P(w j z k ) and Σ kk = p(z k ) Learn with EM by alternate update p(z w,d) and p(w z),p(d z),p(z). [1] Hofmann, Unsupervised learning by probabilistic latent semantic analysis. UAI, Depth Qualification Exam Presentation p. 11/29

12 Topic Models: LDA Each document is a random mixture of corpus-wide topics Each word is draw from one of those topics Depth Qualification Exam Presentation p. 12/29

13 Topic Models: LDA Latent Dirichlet Allocation [1] Fully generative model: extension of plsa Joint distribution: p(w α,β) = p(θ α)( N n=1 Learn with variational EM algorithm ) z n p(z n θ)p(w n z n,β) dθ [1] Blei et al. Latent Dirichlet Allocation. IJMR, 2003 Depth Qualification Exam Presentation p. 13/29

14 ObjectRecognitionin ComputerVision Depth Qualification Exam Presentation p. 14/29

15 ExtractImage Features Extract features from image patches(sift [1],HOG [2],etc.) Learn dictionary from visual features(k-means, sparse coding [3],etc.) Represent images by combining features(histogram, global/local pooling [3][4]) [1] Lowe, Distinctive image features from scale-invariant keypoints, IJCV, 2004 [2] Dalal and Triggs, Histograms of oriented gradients for human detection, CVPR, 2005 [3] Yang et al. Linear spatial pyramid matching using sparse coding for image classification, CVPR, 2009 [4] Boureau et al. Learning mid-level features for recognition, CVPR, 2010 Depth Qualification Exam Presentation p. 15/29

16 ModelImage Features Discriminative model: SVM with linear/hist-intersection/ χ 2 kernel [1] Generative model: Hierarchical Bayesian model could be applied, such as extension of naïve Bayesian model [2], plsa model [3][4], LDA model [5]. [1] Lazebnik et al. Beyond Bags of Features: Spatial Pyramid Mathcing for Recognizing Natural Scene Categories, CVPR, 2006 [1] Dnace et al. Visual categorization with bags of keypoints, ECCV workshop, 2004 [3] Sivic et al. Discovering objects and their location in images, ICCV, 2005 [4] Bosch et al. Scene Classification via plsa, ECCV, 2006 [5] Feifei et al. A Bayesian Hierarchical Model for Learning Natural Scene Categories, CVPR, 2005 Depth Qualification Exam Presentation p. 16/29

17 BayesianModelforObjectRecognition An extension of Bayesian topic model by including location information[1]-[4] symbol description notes w ji K-means index(patch appearance) w ji Multi(ηz) v ji object part(patch location) v ji N(µ k, Λ k ) z ji topic index z ji Multi(πo) ρ j object center location ρ j N(γ, ς) o j image label [1] Sudderth et al. Learning Hierarchical Models of Scenes, Objects, and Parts, ICCV, 2005 [2] Sudderth et al. Describing Visual Scene using Transformed Dirichlet Process, NIPS, 2005 [3] Kivine et al. Learning Multi-scale Representation of Natural Scenes Using Dirichlet Process, ICCV, 2007 [4] Sudderth et al. Describing Visual Scenes using Transformed Objects and Parts, IJCV, 2008 Depth Qualification Exam Presentation p. 17/29

18 BayesianModelforObjectRecognition Each object is a mixture of topics Each appearance and location pair are draw from one of those topics Depth Qualification Exam Presentation p. 18/29

19 My ResearchResult Combine neural network model with topic model Neural network: nonlinear transformation Depth Qualification Exam Presentation p. 19/29

20 My ResearchResult Combine neural network model with topic model Neural network: nonlinear transformation Bayesian Topic Model: transparent to human Depth Qualification Exam Presentation p. 19/29

21 My ResearchResult Combine neural network model with topic model Neural network: nonlinear transformation Bayesian Topic Model: transparent to human Replace regression component of neural network with Bayesian model(topic model) Depth Qualification Exam Presentation p. 19/29

22 My ResearchResult Combine neural network model with topic model Neural network: nonlinear transformation Bayesian Topic Model: transparent to human Replace regression component of neural network with Bayesian model(topic model) Bayesian model with input from the response of neural network Depth Qualification Exam Presentation p. 19/29

23 Whatwe wantto learn Given the input data v with label y, x = f w (v) is output of neural network given input v. The likelihood function is given by: p v (v y) = p x (f w (v) y) det (f w) (v) p x (f w (v) y) defined by generative model (f w ) (v) is the Jacobian matrix Depth Qualification Exam Presentation p. 20/29

24 Whatwe wantto learn Given the input data v with label y, x = f w (v) is output of neural network given input v. The likelihood function is given by: p v (v y) = p x (f w (v) y) det (f w) (v) p x (f w (v) y) defined by generative model (f w ) (v) is the Jacobian matrix Applying Bayesian rule, we have the loss function: p(y v) = p v(v y) ỹ p v(v ỹ) = p x(f w (v) y) ỹ p x(f w (v) ỹ) = p x(y f w (v)) Depth Qualification Exam Presentation p. 20/29

25 Modeloverview y (b) (c) Bayes F 0 Class labels y α β π η S M z u Layer 5 Layer 4 F 1(π) Integration Latent topic z S (=15) M (=45) γ φ K x n i N Layer 3 Integration F 1 (η) Latent word u Gaussian likelihood F 2 (φ={μ,σ}) K (=200) Layer 2 Output x (25 units) Linear layer (W 2) (a) Layer 2 Feature x Linear layer (W 2) (25 units) Layer 1 Hidden units h (600 units) Layer 1 Hidden units h (600 units) Sigmoid layer (W 1) Sigmoid layer (W 1) Input v 128d Input v 128d Depth Qualification Exam Presentation p. 21/29

26 Modeloverview y (b) (c) Bayes F 0 Class labels y α β π η S M z u Layer 5 Layer 4 F 1(π) Integration Latent topic z S (=15) M (=45) γ φ K x n i N Layer 3 Integration F 1 (η) Latent word u Gaussian likelihood F 2 (φ={μ,σ}) K (=200) Layer 2 Output x (25 units) Linear layer (W 2) (a) Layer 2 Feature x Linear layer (W 2) (25 units) Layer 1 Hidden units h (600 units) Layer 1 Hidden units h (600 units) Sigmoid layer (W 1) Sigmoid layer (W 1) Input v 128d Input v 128d 1. We first initialize the parameters {w 0,π 0,η 0,φ 0 } by pre-training of neural network and graphical model Depth Qualification Exam Presentation p. 21/29

27 Modeloverview y (b) (c) Bayes F 0 Class labels y α β π η S M z u Layer 5 Layer 4 F 1(π) Integration Latent topic z S (=15) M (=45) γ φ K x n i N Layer 3 Integration F 1 (η) Latent word u Gaussian likelihood F 2 (φ={μ,σ}) K (=200) Layer 2 Output x (25 units) Linear layer (W 2) (a) Layer 2 Feature x Linear layer (W 2) (25 units) Layer 1 Hidden units h (600 units) Layer 1 Hidden units h (600 units) Sigmoid layer (W 1) Sigmoid layer (W 1) Input v 128d Input v 128d 1. We first initialize the parameters {w 0,π 0,η 0,φ 0 } by pre-training of neural network and graphical model 2. Jointly updated according to the gradient descent: Convert generative model into extra layers of neural network (assume there is a closed form inference in top graphical model). Depth Qualification Exam Presentation p. 21/29

28 GenerativeModel y α π S z β η M u γ φ K x n i N 1. Draw latent topic z j Multi(π yi ) 2. Draw latent word u j Multi(η zi ) 3. Draw feature vector x j Gaussian(φ uj ). Depth Qualification Exam Presentation p. 22/29

29 JointOptimization Overall loss function: L = j logp(f w (v j ) y,π,η,φ)+log S p(f w (v j ) y = i,π,η,φ) i=1 j Depth Qualification Exam Presentation p. 23/29

30 JointOptimization Overall loss function: L = j logp(f w (v j ) y,π,η,φ)+log S p(f w (v j ) y = i,π,η,φ) i=1 j Generative model likelihood function: M K p(f w (v j ) y,π,η,φ) = p(f w (v j ) u i,φ)p(u j z j,η) p(z j y,π) z j =1 u j =1 Depth Qualification Exam Presentation p. 23/29

31 JointOptimization Overall loss function: L = j logp(f w (v j ) y,π,η,φ)+log S p(f w (v j ) y = i,π,η,φ) i=1 j Generative model likelihood function: M K p(f w (v j ) y,π,η,φ) = p(f w (v j ) u i,φ)p(u j z j,η) p(z j y,π) z j =1 u j =1 Trick: decompose likelihood function into small piece Depth Qualification Exam Presentation p. 23/29

32 Unifiedmodel Gaussian Likelihood Layer(F 2 : f w (v j ) p(f w (v j ) u j,φ)): M K p(f w (v j ) y,π,η,φ) = p(f w (v j ) u i,φ) p(u j z j,η) p(z j y,π) }{{} z j =1 u j =1 F 2 Depth Qualification Exam Presentation p. 24/29

33 Unifiedmodel Gaussian Likelihood Layer(F 2 : f w (v j ) p(f w (v j ) u j,φ)): Integration Layer on u(f 1 (.,η)): M K p(f w (v j ) y,π,η,φ) = z j =1 u j =1 p(f w (v j ) u i,φ)p(u j z j,η) p(z j y,π) } {{ } F 1 (.,η) Depth Qualification Exam Presentation p. 25/29

34 Unifiedmodel Gaussian Likelihood Layer(F 2 : f w (v j ) p(f w (v j ) u j,φ)): Integration Layer on u(f 1 (.,η)): Integration Layer on z(f 1 (.,π)): M K p(f w (v j ) y,π,η,φ) = z j =1 u j =1 p(f w (v j ) u i,φ)p(u j z j,η) p(z j y,π) } {{ } F 1 (.,π) Depth Qualification Exam Presentation p. 26/29

35 Unifiedmodel Gaussian Likelihood Layer(F 2 : f w (v j ) p(f w (v j ) u j,φ)): Integration Layer on u(f 1 (.,η)): Integration Layer on z(f 1 (.,π)): Bayesian Layer(F 0 : p(f w (v j ) y) p(y f w (v j ))): L = j logp(f w (v j ) y,π,η,φ)+log S p(f w (v j ) y = i,π,η,φ) i=1 j Depth Qualification Exam Presentation p. 27/29

36 Toy Data Input v 6 4 Features x (Before Backprop) 8 6 Features x (After Backprop) D data with 5 latent cluster draw from 4 classes shape: class label(cross,dot,square,circle) color: model prediction visualization of input after neural network transformation Depth Qualification Exam Presentation p. 28/29

37 Sceneclassificationresult plsa LDA Neural HTM SVM network ± ± 1.2 HTM Hybrid model Hybrid model SVM pre-trained fully trained 65.5± ± ±0.6 Table 1: Classification rates of different methods on scene classification dataset Depth Qualification Exam Presentation p. 29/29

TUTORIAL PART 1 Unsupervised Learning

TUTORIAL PART 1 Unsupervised Learning Marc'Aurelio Ranzato Department of Computer Science Univ. of Toronto ranzato@cs.toronto.edu Co-organizers: Honglak Lee, Yoshua Bengio, Geoff Hinton, Yann LeCun, Andrew