Stochastic Optimization for Deep CCA via Nonlinear Orthogonal Iterations
|
|
- Arleen Skinner
- 6 years ago
- Views:
Transcription
1 Stochastic Optimization for Deep CCA via Nonlinear Orthogonal Iterations Weiran Wang Toyota Technological Institute at Chicago * Joint work with Raman Arora (JHU), Karen Livescu and Nati Srebro (TTIC) 53rd Allerton Conference on Communication, Control, and Computing September 30, 2015
2 Multi-view feature learning Training data consists of samples of a D-dimensional random vector that has some natural split into two sub-vectors: [ ] x, x R Dx, y R Dy, D y x +D y = D. Natural views: audio+video, audio+articulation, text in different languages... Abstract/synthetic: word+context words, different parts of a parse tree... Task: extracting useful features/subspaces in the presence of multiple views which contain complementary information. Motivations: noise suppression, soft supervision, cross-view retrieval/generation... We assume feature dimensions have zero mean for notational simplicity.
3 Canonical correlation analysis (CCA) [Hotelling 1936] Given: data set of N paired vectors {(x 1,y 1 ),...,(x N,y N )}, which are samples of random vectors x R Dx, y R Dy. Find: direction vectors (u, v) that maximize the correlation (u,v) = argmax u,v = argmax u,v corr(u x,v y) u Σ xy v (u Σ xx u)(v Σ yy v) where Σ xy = N i=1 x iy i, Σ xx = N i=1 x ix i, Σ yy = N i=1 y iy i. Subsequent direction vectors maximize the same correlation, subject to being uncorrelated with previous directions.
4 Canonical correlation analysis (CCA) Extracting L-dimensional projections U R Dx L, V R Dy L ( ) max tr U Σ xy V U,V s.t. U Σ xx U = V Σ yy V = I. Closed-form solution obtained by SVD of Σ xy = Σ 1/2 xx Σ xy Σ 1/2 yy.
5 Canonical correlation analysis (CCA) Extracting L-dimensional projections U R Dx L, V R Dy L ( ) max tr U Σ xy V U,V s.t. U Σ xx U = V Σ yy V = I. Closed-form solution obtained by SVD of Σ xy = Σ 1/2 xx Σ xy Σ 1/2 yy. Alternative formulation Let X = [x 1,...,x N ], Y = [y 1,...,y N ]. Then CCA equivalently solves 1 min U X V Y 2 U,V 2 = 1 N U x i V 2 y i F 2 i=1 s.t. (U X)(X U) = (V Y)(Y V) = I.
6 Canonical correlation analysis (CCA) Extracting L-dimensional projections U R Dx L, V R Dy L ( ) max tr U Σ xy V U,V s.t. U Σ xx U = V Σ yy V = I. Closed-form solution obtained by SVD of Σ xy = Σ 1/2 xx Σ xy Σ 1/2 yy. Alternative formulation Let X = [x 1,...,x N ], Y = [y 1,...,y N ]. Then CCA equivalently solves 1 min U X V Y 2 U,V 2 = 1 N U x i V 2 y i F 2 i=1 s.t. (U X)(X U) = (V Y)(Y V) = I. But CCA can only find linear subspaces...
7 Deep CCA [Andrew, Arora, Bilmes and Livescu 2013] Transform the input of each view nonlinearly with deep neural networks (DNNs), such that the canonical correlation (measured by CCA) between the outputs is maximized. View 2 U f View 1 g V x y Final projection: f(x) = U f(x), g(y) = V g(y). A parametric nonlinear extension of CCA better scaling to large data than the kernel extension of CCA [Lai & Fyfe 2000, Akaho 2001, Melzer et al. 2001, Bach & Jordan 2002].
8 Deep CCA: objective and gradient Objective over DNN weights (W f,w g ) and CCA projections (U,V) ( ) max tr U FG V W f,w g,u,v s.t. U FF U = V GG V = I, where F = f(x) = [f(x 1 ),...,f(x N )] and G = g(y) = [g(y 1 ),...,g(y N )].
9 Deep CCA: objective and gradient Objective over DNN weights (W f,w g ) and CCA projections (U,V) ( ) max tr U FG V W f,w g,u,v s.t. U FF U = V GG V = I, where F = f(x) = [f(x 1 ),...,f(x N )] and G = g(y) = [g(y 1 ),...,g(y N )]. Gradient computation [Andrew, Arora, Bilmes and Livescu 2013] Let Σ fg = FG, Σ ff = FF, Σ gg = GG, and Σ fg = Σ 1/2 ff Σ fg Σ 1/2 gg = ŨΛṼ be its SVD. Then l σ l( Σ fg ) F = 2 ff F+ fg G, where ff = 1 2 Σ 1/2 ff ŨΛŨ Σ 1/2 ff, fg = Σ 1/2 ff ŨṼ Σ 1/2 gg. Gradients with respect to (W f,w g ) are computed via back-propagation.
10 Stochastic optimization of deep CCA ( ) max tr U FG V W f,w g,u,v s.t. U FF U = V GG V = I, Objective is not expectation of loss over samples due to the constraints. More difficult than PCA/PLS [Arora, Cotter, Livescu and Srebro 2012]. Exact gradient computation requires feeding-forward all data through the DNNs. Back-propagation requires large memory for large DNNs, can not be run on GPUs.
11 Stochastic optimization of deep CCA ( ) max tr U FG V W f,w g,u,v s.t. U FF U = V GG V = I, Objective is not expectation of loss over samples due to the constraints. More difficult than PCA/PLS [Arora, Cotter, Livescu and Srebro 2012]. Exact gradient computation requires feeding-forward all data through the DNNs. Back-propagation requires large memory for large DNNs, can not be run on GPUs. Question: can we do stochastic optimization for deep CCA?
12 Approach I : Large minibatches (STOL) (n) ˆΣ fg = 1/2 ˆΣ ff ˆΣ fgˆσ 1/2 gg Use a minibatch of n samples to estimate and the gradient. Works well for n large enough [Wang, Arora, Livescu and Bilmes 2015]. Total Canon. Corr STOL 100 STOL 200 STOL 300 STOL 400 STOL 500 STOL 750 STOL 1000 L-BFGS hours Does not work for small n because E [ˆΣ(n) fg ] Σ fg, due to the nonlinearities in computing Σ fg (matrix inversion, multiplication). Therefore, the gradient estimated on a minibatch is not unbiased estimate of the true gradient.
13 Alternating least squares for CCA Alternating least squares [Golub & Zha 1995]: run orthogonal iterations to obtain singular vectors (Ũ,Ṽ) of Σ fg. Input: Data matrices X, Y. Initial Ũ0 R dx L s.t. Ũ 0 Ũ0 = I. A 0 Ũ 0 Σ 1 2 ff X for t = 1,2,...,T do B t A t 1 Y ( YY ) 1 Y % Least squares regression Y At 1 B t ( ) B t B 1 2 t B t % Orthogonalize B t A t B t X ( XX ) 1 X % Least squares regression X Bt A t ( ) A t A 1 2 t A t % Orthogonalize A t end for Output: A T Ũ X, B T Ṽ Y are CCA projections as T. This procedure converges linearly under mild conditions. [Lu & Foster 2014] used a similar procedure for linear CCA with high dimensional sparse inputs.
14 Approach II : Nolinear orthogonal iterations (NOI) Choose a minibatch b of n samples at each step, and
15 Approach II : Nolinear orthogonal iterations (NOI) Choose a minibatch b of n samples at each step, and Adaptively estimate covariance matrix for orthogonalization Σ g g ρσ g g +(1 ρ) N g(y i ) g(y i ) n i b Time constant ρ [0,1). Update form is similar to that of momentum and widely used in subspace tracking. Saving Σ g g R L L costs little memory as L is usually small.
16 Approach II : Nolinear orthogonal iterations (NOI) Choose a minibatch b of n samples at each step, and Adaptively estimate covariance matrix for orthogonalization Σ g g ρσ g g +(1 ρ) N g(y i ) g(y i ) n i b Time constant ρ [0,1). Update form is similar to that of momentum and widely used in subspace tracking. Saving Σ g g R L L costs little memory as L is usually small. Replace exact linear regression with nonlinear least squares and take a gradient descent step on the minibatch 1 n f(x i ) Σ 1 2 g g g(y 2 i) min W f i b Ordinary DNN regression problem. No involved gradient.
17 Approach II : Nolinear orthogonal iterations (NOI) Choose a minibatch b of n samples at each step, and Adaptively estimate covariance matrix for orthogonalization Σ g g ρσ g g +(1 ρ) N g(y i ) g(y i ) n i b Time constant ρ [0,1). Update form is similar to that of momentum and widely used in subspace tracking. Saving Σ g g R L L costs little memory as L is usually small. Replace exact linear regression with nonlinear least squares and take a gradient descent step on the minibatch 1 n f(x i ) Σ 1 2 g g g(y 2 i) min W f i b Ordinary DNN regression problem. No involved gradient. Each step feeds-forward and back-propagates only n samples!
18 Approach II : Nolinear orthogonal iterations (NOI) Choose a minibatch b of n samples at each step, and Adaptively estimate covariance matrix for orthogonalization Σ g g ρσ g g +(1 ρ) N g(y i ) g(y i ) n i b Time constant ρ [0,1). Update form is similar to that of momentum and widely used in subspace tracking. Saving Σ g g R L L costs little memory as L is usually small. Replace exact linear regression with nonlinear least squares and take a gradient descent step on the minibatch 1 n f(x i ) Σ 1 2 g g g(y 2 i) min W f i b Ordinary DNN regression problem. No involved gradient. Each step feeds-forward and back-propagates only n samples! [Ma, Lu and Foster 2015] proposed a similar algorithm AppGrad for CCA with ρ 0.
19 Experiments: Datasets Compare three optimizers on two real-world datasets Limited-memory BFGS run with full-batch gradient Stochastic optimization with large minibatches (STOL) Nonlinear Orthogonal Iterations (NOI) Hyper-parameters (time constant ρ, minibatch size n, learning rate η) are tuned by grid search. STOL/NOI run for a maximum number of 50 epochs. Statistics of two real-world datasets. dataset training/tuning/test L DNN architectures JW11 30K/11K/9K MNIST 50K/10K/10K
20 Experiments: Effect of minibatch size n Canon. Corr JW11 MNIST 60 L-BFGS n = N 50 STOL n = STOL n = NOI n = 10 NOI n = NOI n = 50 NOI n = epoch epoch Learning curves on tuning sets with different minibatch size n. NOI works well with various small minibatch sizes. NOI gives steep improvement in the first few passes over the data.
21 Experiments: Effect of time constant ρ Canon. Corr JW11 44 n = 10 n = n = 50 n = ρ MNIST Total correlation achieved by NOI on tuning sets with different ρ. ρ NOI works well for a wide range of ρ. Beneficial to use large ρ to incorporate the previous estimate of covariance for small n.
22 Experiments: Pure stochastic optimization 30 Canon. Corr Random Init. SVD STOL n = 500 NOI n = Total correlation achieved on MNIST training sets at different ρ for linear CCA. ρ AppGrad [Ma, Lu and Foster 2015] used n O(L) with ρ = 0. Pure stochastic optimization n = 1 works well with large ρ.
23 Conclusions We have developed NOI for training deep CCA with small minibatches, alleviating the memory cost. NOI performs competitively to previous batch and stochastic optimizers.
24 Conclusions We have developed NOI for training deep CCA with small minibatches, alleviating the memory cost. NOI performs competitively to previous batch and stochastic optimizers. Future directions: Gradients of nonlinear least squares problems in NOI are not unbiased estimate of gradient of deep CCA objective. Need to better understand the convergence properties of NOI.
25 Conclusions We have developed NOI for training deep CCA with small minibatches, alleviating the memory cost. NOI performs competitively to previous batch and stochastic optimizers. Future directions: Gradients of nonlinear least squares problems in NOI are not unbiased estimate of gradient of deep CCA objective. Need to better understand the convergence properties of NOI. Thank you!
26 Bibliography I Shotaro Akaho. A kernel method for canonical correlation analysis. In Proceedings of the International Meeting of the Psychometric Society (IMPS2001), Osaka, Japan, Springer-Verlag. Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. Deep canonical correlation analysis. In Sanjoy Dasgupta and David McAllester, editors, Proc. of the 30th Int. Conf. Machine Learning (ICML 2013), pages , Atlanta, GA, June Raman Arora, Andy Cotter, Karen Livescu, and Nati Srebro. Stochastic optimization for PCA and PLS. In 50th Annual Allerton Conference on Communication, Control, and Computing, pages , Montcello, IL, October Francis R. Bach and Michael I. Jordan. Kernel independent component analysis. Journal of Machine Learning Research, 3:1 48, July Gene H. Golub and Hongyuan Zha. Linear Algebra for Signal Processing, volume 69 of The IMA Volumes in Mathematics and its Applications, chapter The Canonical Correlations of Matrix Pairs and their Numerical Computation, pages Springer-Verlag, Harold Hotelling. Relations between two sets of variates. Biometrika, 28(3/4): , December P. L. Lai and C. Fyfe. Kernel and nonlinear canonical correlation analysis. Int. J. Neural Syst., 10(5): , October 2000.
27 Bibliography II Zhuang Ma, Yichao Lu, and Dean Foster. Finding linear structure in large datasets with scalable canonical correlation analysis. In Francis Bach and David Blei, editors, Proc. of the 32st Int. Conf. Machine Learning (ICML 2015), pages , Lille, France, July Thomas Melzer, Michael Reiter, and Horst Bischof. Nonlinear feature extraction using generalized canonical correlation analysis. In G. Dorffner, H. Bischof, and K. Hornik, editors, Proc. of the 11th Int. Conf. Artificial Neural Networks (ICANN 01), pages , Vienna, Austria, August Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes. Unsupervised learning of acoustic features via deep canonical correlation analysis. In Proc. of the IEEE Int. Conf. Acoustics, Speech and Sig. Proc. (ICASSP 15), Brisbane, Australia, April
Statistical Convergence of Kernel CCA
Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,
More informationA summary of Deep Learning without Poor Local Minima
A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given
More informationESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,
Sparse Kernel Canonical Correlation Analysis Lili Tan and Colin Fyfe 2, Λ. Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong. 2. School of Information and Communication
More informationKernel-Based Contrast Functions for Sufficient Dimension Reduction
Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach
More informationFast Learning with Noise in Deep Neural Nets
Fast Learning with Noise in Deep Neural Nets Zhiyun Lu U. of Southern California Los Angeles, CA 90089 zhiyunlu@usc.edu Zi Wang Massachusetts Institute of Technology Cambridge, MA 02139 ziwang.thu@gmail.com
More informationMultimodal Deep Learning for Predicting Survival from Breast Cancer
Multimodal Deep Learning for Predicting Survival from Breast Cancer Heather Couture Deep Learning Journal Club Nov. 16, 2016 Outline Background on tumor histology & genetic data Background on survival
More information<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)
Learning for Deep Neural Networks (Back-propagation) Outline Summary of Previous Standford Lecture Universal Approximation Theorem Inference vs Training Gradient Descent Back-Propagation
More informationNormalization Techniques in Training of Deep Neural Networks
Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,
More informationPCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationMulti-view representation learning for speech (and language)
Multi-view representation learning for speech (and language) Karen Livescu Joint work with Qingming Tang, Weiran Wang, Raman Arora, Galen Andrew, Jeff Bilmes,... 1 / 37 Text vs. speech This is a speech
More informationLogistic Regression. William Cohen
Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 12 Jan-Willem van de Meent (credit: Yijun Zhao, Percy Liang) DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Linear Dimensionality
More informationUnsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent
Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Dimensionality Reduction Goal:
More informationInternal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky.
Batch Normalization Devin Willmott University of Kentucky October 23, 2017 Overview 1 Internal Covariate Shift 2 Batch Normalization 3 Implementation 4 Experiments Covariate Shift Suppose we have two distributions,
More informationarxiv: v1 [stat.ml] 16 Jan 2018
Deep Canonically Correlated LSTMs Mallinar, Neil nmallinar@gmail.com Rosset, Corbin corbyrosset@gmail.com arxiv:1801.05407v1 [stat.ml] 16 Jan 2018 Abstract We examine Deep Canonically Correlated LSTMs
More informationLearning Deep Architectures for AI. Part II - Vijay Chakilam
Learning Deep Architectures for AI - Yoshua Bengio Part II - Vijay Chakilam Limitations of Perceptron x1 W, b 0,1 1,1 y x2 weight plane output =1 output =0 There is no value for W and b such that the model
More informationA Coupled Helmholtz Machine for PCA
A Coupled Helmholtz Machine for PCA Seungjin Choi Department of Computer Science Pohang University of Science and Technology San 3 Hyoja-dong, Nam-gu Pohang 79-784, Korea seungjin@postech.ac.kr August
More informationMachine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari
Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets
More informationTTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention
TTIC 31230, Fundamentals of Deep Learning David McAllester, April 2017 Sequence to Sequence Models and Attention Encode-Decode Architectures for Machine Translation [Figure from Luong et al.] In Sutskever
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationA Least Squares Formulation for Canonical Correlation Analysis
A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University Motivation Canonical Correlation
More informationA Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir
A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationUNSUPERVISED LEARNING OF ACOUSTIC FEATURES VIA DEEP CANONICAL CORRELATION ANALYSIS
UNSUPERVISED LEARNING OF ACOUSTIC FEATURES VIA DEEP CANONICAL CORRELATION ANALYSIS Weiran Wang 1 Raman Arora 2 Karen Livescu 1 Jeff A. Bilmes 3 1 TTI-Chicago 2 Johns Hopkins University 3 University of
More informationIntroduction to Neural Networks
CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationStochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration
Stochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration Nati Srebro Toyota Technological Institute at Chicago a philanthropically endowed academic computer science institute dedicated
More informationSpeaker Representation and Verification Part II. by Vasileios Vasilakakis
Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationIntroduction to Convolutional Neural Networks (CNNs)
Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei
More informationStatistical Machine Learning
Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationSajid Anwar, Kyuyeon Hwang and Wonyong Sung
Sajid Anwar, Kyuyeon Hwang and Wonyong Sung Department of Electrical and Computer Engineering Seoul National University Seoul, 08826 Korea Email: sajid@dsp.snu.ac.kr, khwang@dsp.snu.ac.kr, wysung@snu.ac.kr
More informationDIFFERENTIABLE CANONICAL CORRELATION ANALYSIS
DIFFERENTIABLE CANONICAL CORRELATION ANALYSIS Matthias Dorfer Department of Computational Perception Johannes Kepler University Linz Linz, 4040, Austria matthias.dorfer@jku.at Jan Schlüter The Austrian
More informationCanonical Correlation Analysis with Kernels
Canonical Correlation Analysis with Kernels Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Computational Diagnostics Group Seminar 2003 Mar 10 1 Overview
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationDeep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang
Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i
More informationMachine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function
More informationDeep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści
Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?
More informationWHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,
WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WITH IMPLICATIONS FOR TRAINING Sanjeev Arora, Yingyu Liang & Tengyu Ma Department of Computer Science Princeton University Princeton, NJ 08540, USA {arora,yingyul,tengyu}@cs.princeton.edu
More informationAccelerating SVRG via second-order information
Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical
More informationClassification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box
ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses
More informationLearning with Singular Vectors
Learning with Singular Vectors CIS 520 Lecture 30 October 2015 Barry Slaff Based on: CIS 520 Wiki Materials Slides by Jia Li (PSU) Works cited throughout Overview Linear regression: Given X, Y find w:
More informationAdvanced computational methods X Selected Topics: SGD
Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationIntroduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis
Introduction to Natural Computation Lecture 9 Multilayer Perceptrons and Backpropagation Peter Lewis 1 / 25 Overview of the Lecture Why multilayer perceptrons? Some applications of multilayer perceptrons.
More informationCS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning
CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning Lei Lei Ruoxuan Xiong December 16, 2017 1 Introduction Deep Neural Network
More informationStochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints
Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Thang D. Bui Richard E. Turner tdb40@cam.ac.uk ret26@cam.ac.uk Computational and Biological Learning
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationParallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationClassification with Perceptrons. Reading:
Classification with Perceptrons Reading: Chapters 1-3 of Michael Nielsen's online book on neural networks covers the basics of perceptrons and multilayer neural networks We will cover material in Chapters
More informationLinear classifiers: Overfitting and regularization
Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x
More informationNonlinear Optimization Methods for Machine Learning
Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks
More informationMachine Learning with Quantum-Inspired Tensor Networks
Machine Learning with Quantum-Inspired Tensor Networks E.M. Stoudenmire and David J. Schwab Advances in Neural Information Processing 29 arxiv:1605.05775 RIKEN AICS - Mar 2017 Collaboration with David
More informationDon t Decay the Learning Rate, Increase the Batch Size. Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017
Don t Decay the Learning Rate, Increase the Batch Size Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017 slsmith@ Google Brain Three related questions: What properties control generalization?
More informationSTATS 306B: Unsupervised Learning Spring Lecture 13 May 12
STATS 306B: Unsupervised Learning Spring 2014 Lecture 13 May 12 Lecturer: Lester Mackey Scribe: Jessy Hwang, Minzhe Wang 13.1 Canonical correlation analysis 13.1.1 Recap CCA is a linear dimensionality
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationBidirectional Representation and Backpropagation Learning
Int'l Conf on Advances in Big Data Analytics ABDA'6 3 Bidirectional Representation and Bacpropagation Learning Olaoluwa Adigun and Bart Koso Department of Electrical Engineering Signal and Image Processing
More informationLecture 6. Notes on Linear Algebra. Perceptron
Lecture 6. Notes on Linear Algebra. Perceptron COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne This lecture Notes on linear algebra Vectors
More informationVC-dimension of a context-dependent perceptron
1 VC-dimension of a context-dependent perceptron Piotr Ciskowski Institute of Engineering Cybernetics, Wroc law University of Technology, Wybrzeże Wyspiańskiego 27, 50 370 Wroc law, Poland cis@vectra.ita.pwr.wroc.pl
More informationBits of Machine Learning Part 1: Supervised Learning
Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationWelcome to the Machine Learning Practical Deep Neural Networks. MLP Lecture 1 / 18 September 2018 Single Layer Networks (1) 1
Welcome to the Machine Learning Practical Deep Neural Networks MLP Lecture 1 / 18 September 2018 Single Layer Networks (1) 1 Introduction to MLP; Single Layer Networks (1) Steve Renals Machine Learning
More informationIntroduction to Neural Networks
Introduction to Neural Networks Steve Renals Automatic Speech Recognition ASR Lecture 10 24 February 2014 ASR Lecture 10 Introduction to Neural Networks 1 Neural networks for speech recognition Introduction
More informationL 2,1 Norm and its Applications
L 2, Norm and its Applications Yale Chang Introduction According to the structure of the constraints, the sparsity can be obtained from three types of regularizers for different purposes.. Flat Sparsity.
More informationOnline Generalized Eigenvalue Decomposition: Primal Dual Geometry and Inverse-Free Stochastic Optimization
Online Generalized Eigenvalue Decomposition: Primal Dual Geometry and Inverse-Free Stochastic Optimization Xingguo Li Zhehui Chen Lin Yang Jarvis Haupt Tuo Zhao University of Minnesota Princeton University
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationSubspace Methods for Visual Learning and Recognition
This is a shortened version of the tutorial given at the ECCV 2002, Copenhagen, and ICPR 2002, Quebec City. Copyright 2002 by Aleš Leonardis, University of Ljubljana, and Horst Bischof, Graz University
More informationMidterm: CS 6375 Spring 2015 Solutions
Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationpairs. Such a system is a reinforcement learning system. In this paper we consider the case where we have a distribution of rewarded pairs of input an
Learning Canonical Correlations Hans Knutsson Magnus Borga Tomas Landelius knutte@isy.liu.se magnus@isy.liu.se tc@isy.liu.se Computer Vision Laboratory Department of Electrical Engineering Linkíoping University,
More informationON HYPERPARAMETER OPTIMIZATION IN LEARNING SYSTEMS
ON HYPERPARAMETER OPTIMIZATION IN LEARNING SYSTEMS Luca Franceschi 1,2, Michele Donini 1, Paolo Frasconi 3, Massimiliano Pontil 1,2 (1) Istituto Italiano di Tecnologia, Genoa, 16163 Italy (2) Dept of Computer
More informationDeep Feedforward Networks
Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3
More informationCSCI567 Machine Learning (Fall 2014)
CSCI567 Machine Learning (Fall 24) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu October 2, 24 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 24) October 2, 24 / 24 Outline Review
More informationLinear Models for Regression. Sargur Srihari
Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood
More informationMachine Learning. Boris
Machine Learning Boris Nadion boris@astrails.com @borisnadion @borisnadion boris@astrails.com astrails http://astrails.com awesome web and mobile apps since 2005 terms AI (artificial intelligence)
More informationImproving L-BFGS Initialization for Trust-Region Methods in Deep Learning
Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University
More informationArtificial Neural Networks. MGS Lecture 2
Artificial Neural Networks MGS 2018 - Lecture 2 OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationThe Geometry Of Kernel Canonical Correlation Analysis
Max Planck Institut für biologische Kybernetik Max Planck Institute for Biological Cybernetics Technical Report No. 108 The Geometry Of Kernel Canonical Correlation Analysis Malte Kuss 1 and Thore Graepel
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationA two-layer ICA-like model estimated by Score Matching
A two-layer ICA-like model estimated by Score Matching Urs Köster and Aapo Hyvärinen University of Helsinki and Helsinki Institute for Information Technology Abstract. Capturing regularities in high-dimensional
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationClustering VS Classification
MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationImmediate Reward Reinforcement Learning for Projective Kernel Methods
ESANN'27 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), 25-27 April 27, d-side publi., ISBN 2-9337-7-2. Immediate Reward Reinforcement Learning for Projective Kernel Methods
More informationReward-modulated inference
Buck Shlegeris Matthew Alger COMP3740, 2014 Outline Supervised, unsupervised, and reinforcement learning Neural nets RMI Results with RMI Types of machine learning supervised unsupervised reinforcement
More informationThe role of dimensionality reduction in classification
The role of dimensionality reduction in classification Weiran Wang and Miguel Á. Carreira-Perpiñán Electrical Engineering and Computer Science University of California, Merced http://eecs.ucmerced.edu
More information1 Introduction Independent component analysis (ICA) [10] is a statistical technique whose main applications are blind source separation, blind deconvo
The Fixed-Point Algorithm and Maximum Likelihood Estimation for Independent Component Analysis Aapo Hyvarinen Helsinki University of Technology Laboratory of Computer and Information Science P.O.Box 5400,
More informationNeural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science
Neural Networks Prof. Dr. Rudolf Kruse Computational Intelligence Group Faculty for Computer Science kruse@iws.cs.uni-magdeburg.de Rudolf Kruse Neural Networks 1 Supervised Learning / Support Vector Machines
More informationDemystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis
T. BEN-NUN, T. HOEFLER Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis https://www.arxiv.org/abs/1802.09941 What is Deep Learning good for? Digit Recognition Object
More informationA Spectral Regularization Framework for Multi-Task Structure Learning
A Spectral Regularization Framework for Multi-Task Structure Learning Massimiliano Pontil Department of Computer Science University College London (Joint work with A. Argyriou, T. Evgeniou, C.A. Micchelli,
More informationCharacterization of Gradient Dominance and Regularity Conditions for Neural Networks
Characterization of Gradient Dominance and Regularity Conditions for Neural Networks Yi Zhou Ohio State University Yingbin Liang Ohio State University Abstract zhou.1172@osu.edu liang.889@osu.edu The past
More informationLinear and Non-Linear Dimensionality Reduction
Linear and Non-Linear Dimensionality Reduction Alexander Schulz aschulz(at)techfak.uni-bielefeld.de University of Pisa, Pisa 4.5.215 and 7.5.215 Overview Dimensionality Reduction Motivation Linear Projections
More informationIntroduction to Neural Networks
Introduction to Neural Networks Philipp Koehn 4 April 205 Linear Models We used before weighted linear combination of feature values h j and weights λ j score(λ, d i ) = j λ j h j (d i ) Such models can
More informationIntroduction to Logistic Regression and Support Vector Machine
Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel
More information