Random Matrix Theory for Neural Networks

Size: px
Start display at page:

Download "Random Matrix Theory for Neural Networks"

Transcription

1 Random Matrix Theory for Neural Networks Ph.D. Mid-Term Evaluation Zhenyu Liao Laboratoire des Signaux et Systèmes CentraleSupélec Université Paris-Saclay Salle sd.207, Bâtiment Bouygues Gif-sur-Yvette, France Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, 208 / 26

2 Outline Curriculum Vitae 2 Introduction 3 Summary of main results 4 Conclusion Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

3 Curriculum Vitae Education Ph.D. in Statistics and Signal Processing, L2S, CentraleSupélec, France 206-present Thesis: Random Matrix Theory for Neural Networks. Supervisor: Prof. Romain Couillet, Prof. Yacine Chitour. M.Sc. in Signal and Image Processing, CentraleSupélec/Paris-Sud, France B.Sc. in Electronic Engineering, Paris-Sud, France B.Sc. in Optical & Electronic Information, HUST, Wuhan, China Ph.D. training Scientific training [completed] Random matrix theory and machine learning application: 27 hours. Summer school of signal and image processing in Peyresq: 2 hours. Professional training [completed] Law and intellectual property: 8 hours. European projects Horizon 2020: 8 hours. Techniques for scientific writing and associated softwares: 5 hours. Teaching : Lab work of Signal and System, with Prof. Laurent Le Brusquet, Department of Signal and Statistics, CentraleSupélec: 54 hours. Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

4 Curriculum Vitae Review activities IEEE Transactions on Signal Processing Neural Processing Letters Publications Conferences Z. Liao, R. Couillet, The Dynamics of Learning: A Random Matrix Approach, (submitted to) The 35th International Conference on Machine Learning (ICML 208), Stockholm, Sweden, 208. Z. Liao, R. Couillet, On the Spectrum of Random Features Maps of High Dimensional Data, (submitted to) The 35th International Conference on Machine Learning (ICML 208), Stockholm, Sweden, 208. Z. Liao, R. Couillet, Une Analyse des Méthodes de Projections Aléatoires par la Théorie des Matrices Aléatoires (in French), Colloque GRETSI 7, Juan Les Pins, France, 207. Z. Liao, R. Couillet, Random Matrices Meet Machine Learning: A Large Dimensional Analysis of LS-SVM, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 7), New Orleans, USA, 207. Journals C. Louart, Z. Liao, R. Couillet, A Random Matrix Approach to Neural Networks, (in press) Annals of Applied Probability, 207. Z. Liao, R. Couillet, A Large Dimensional Analysis of Least Squares Support Vector Machines, (submitted to) Journal of Machine Learning Research, 207. Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

5 Motivation: features in machine learning Learning = Representation + Evaluation + Optimization. Features: representation of the data that contains crucial information. Various methods for feature extraction: feature selection by hand feature learned via backpropagation random feature maps How to study and understand these features? Sample Covariance Matrix SCM T XXT of some data X = [x,..., x T ] R p T. SCM in feature space feature Gram matrix G: G T FT F with F = [f(x ),..., f(x T )] feature matrix of data X = [x,..., x T ]. Domingos, Pedro. A few useful things to know about machine learning. Communications of the ACM 55.0 (202): Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

6 Motivation: in random feature maps (RFM) SCM in feature space feature Gram matrix G: G T FT F with F R n T feature matrix of data X. In RFM, G determines training and test performance via its resolvent Q(z) (G zi T ). Example : MSE of RFM-based ridge regression (also called extreme learning machines): E train = T y βt F 2 F = γ2 T yt Q 2 ( γ)y E test = ˆT ŷ β T ˆF 2 F with ridge regressor β T F (G + γi T ) y T = T FQ( γ)yt and regularization γ > 0. y associated target of training data X and ŷ target of test data ˆX. Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

7 Motivation: gradient descent dynamics Example 2: Feature matrix F with associated labels y, a classifier trained by GD is to minimize L(w) = 2T y wt F 2 with constant small learning rate α, GD dynamics yields: w(t) t = α wl = α ( T F(y FT w(t)) w(t) = e αt T FFT w 0 + I p e αt FFT) T (FF T ) Fy. Generalization performance of new feature ˆf related to: w(t) Tˆf = f t(z)ˆf T Q(z)w 0 dz 2πi 2πi γ γ f t(z) T ˆf Q(z) Fy dz z T with f t(z) = exp( αtz), Q resolvent of G = T FFT, γ contours all eigenvalues of G. Important remark Both examples: study of eigenspectrum through resolvent of SCM-like matrices (G or G), especially in high dimensional regime today where n, p, T are comparably large. Random Matrix Theory (RMT) is the answer! Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

8 Motivation: difficulty from nonlinearity However, highly nonlinear and abstract nature of real-world problems classical RMT results cannot apply directly in need of new tools to adapt to nonlinear models Example: random feature maps data W R n p σ( ) entry-wise features x R p σ(wx) R n Figure: Illustration of random feature maps with random matrix W R n p with i.i.d. entries and (nonlinear) activation function σ( ). Classical RMT results essentially based on trace lemma: for A R n n of bounded operator norm and random vector w R n with i.i.d. entries n wt Aw n tr(a) 0 almost surely as n. Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

9 Motivation: difficulty from nonlinearity Example: random feature maps data W R n p σ( ) entry-wise features x R p σ(wx) R n Figure: Illustration of random feature maps Classical RMT results are essentially based on the trace lemma: for A R n n of bounded operator norm and random vector w R n with i.i.d. entries n wt Aw n tr(a) 0 almost surely as n. However, object under study n σ(wt X)Aσ(X T w): loss of independence between entries more elusive due to σ( ) One major objective of the thesis: handle nonlinearity in RMT Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, 208 / 26

10 Handle nonlinearity in RMT: concentration Lemma: Concentration of quadratic forms For w N (0, I p), X, A bounded and σ( ) Lipschitz, as n, p with p n c, ( P n σ(wt X)Aσ(X T w) ) n tr(φa) > t Ce cn min(t,t2 ) where Φ E w [ σ(x T w)σ(w T X) ]. [ ] Φ(a, b) E w σ(w T a)σ(w T b) = (2π) p 2 σ(w T a)σ(w T b)e 2 w 2 dw R p = σ( w T ã)σ( w T b)e 2 w 2 d w (projection on span(a, b)) 2π R 2 For σ(t) = max(t, 0) ReLU(t), Φ(a, b) = w T ã w T b e 2 w 2 d w = ( ) 2π 2π a b 2 + arccos ( ) S where S min ( w T ã, w T b ) > 0, at b a b. Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

11 Handle nonlinearity in RMT: concentration Behavior of Φ as a function of X: concentration phenomenon for large n, p, T For ReLU function, Φ(x i, x j ) = ( ) 2π x i x j 2 (x i, x j ) + (x i, x j ) arccos ( (x i, x j )) Assumption: GMM data ( ) For i =,..., T, x i N p µ a, p Ca has cardinality T a, for a =,..., K. and x i = p µ a + ω i, with ω i N (0, Ca), class Ca p x T i x j = ( µ T a / p + ω T i ) (µb / p + ω j ) = µ T a µ b /p + (µt a ω j + µ T b ω i)/ p + ω T i ω j x T i x i = ( µ T a / p + ω T i ) (µa / p + ω i ) = µ a 2 /p + 2µ T a ω i/ p + ω i 2 Growth rate [Neyman-Pearson Minimal] As n, p, T, let C K T a a= T C a and for a =,..., K, C a C a C, the Euclidean norm µ a = O() the operator norm C a = O() and tr(c a) = O( p) Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

12 Handle nonlinearity in RMT: concentration Growth rate [Neyman-Pearson Minimal] As n, p, T, let C K T a a= T C a and for a =,..., K, C a C a C, the Euclidean norm µ a = O() the operator norm C a = O() and tr(c a) = O( p) ( ) ( ) xi T x j = µ T p a + ωt i µ p b + ω j = p µt a µ b + (µ T p a ω j + µ T b ω i) + ω T i ω j }{{}}{{} ( ) ( xi T x i = µ T p a + ωt i O(p ) O(p ) ) µ p a + ω i = p µ a µ T p a ω i + ω i 2 = p µ a µ T p a ω i + ω i 2 p tr(ca) + p tr(ca ) + τ }{{}}{{}}{{} O() with E ωi ω i 2 = tr(c a)/p and τ tr(c )/p. O(p /2 ) g(x T i x i) concentrates around g(τ) Taylor expansion O(p /2 ) Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

13 Main results Theorem (Asymptotic equivalent of Φ c ) [ ] Recenter Φ E w σ(x T w)σ(w T X) to get Φ c PΦP, with P I T T T T T. As T, Φ c Φ c 0 almost surely, with Φ c P ΦP and Φ d ( Ω + M JT p ) T ( Ω + M JT p ) + d 2 UBU T + d 0 I T as well as U [ J p, φ ], B [ tt T + 2S ] t t T [ ] { [ where Ω ω,..., ω T, φ ωi 2 E ω i 2]} T, and data i= statistics2, [ ] M µ,..., µ K, t {tr C a / [ ] p} K a=, S {tr(cac b)/p} K a,b=, J j,..., j K where j a R T denotes the canonical vector of class C a such that (j a) i = δ xi C a. 2 M for means, t for (difference in) traces while S for the shapes of covariances. Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

14 Coefficients d i for different activation functions Table: Φ(a, b) for different σ( ), (a, b) at b a b. σ(t) Φ(a, b) t a T b ) max(t, 0) ReLU(t) 2π ( (a, a b b) arccos ( (a, b)) + (a, b) ) 2 2 t π ( (a, a b b) arcsin ( (a, b)) + (a, b) 2 t>0 2 2π arccos ( (a, b)) 2 sign(t) ( π arcsin ( (a, b)) ( ς 2t 2 + ς t + ς 0 ς a T b cos(t) sin(t) ) ) 2 ( + a 2 b 2 + ς 2 at b + ς 2ς 0 a 2 + b 2) + ς 2 0 ( ( exp 2 a 2 + b 2)) cosh(a T b) ( ( exp 2 a 2 + b 2)) sinh(a T b) ) ( 2 erf(t) π arcsin 2a T b (+2 a 2 )(+2 b 2 ) exp( t2 2 ) (+ a 2 )(+ b 2 ) (a T b) 2 Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

15 Coefficients d i for different activation functions Table: Coefficients d i in Φ c for different σ( ). σ(t) d 0 d d 2 t ( 0 0 max(t, 0) ReLU(t) ) 4 2π τ ( ) 4 8πτ t 2 π τ 0 2πτ t>0 4 2π 2πτ 0 sign(t) 2 2 π πτ 0 ς 2t 2 + ς t + ς 0 2τ 2 ς 2 2 ς 2 ς 2 2 cos(t) 2 + e 2τ 2 e τ e 0 τ 4 sin(t) ( 2 e 2τ ( 2 τe τ e τ 0 2 erf(t) π arccos 2τ ) ) 2τ+ 2τ 4 2τ+ π 2τ+ 0 exp( t2 2 ) 2τ+ τ+ 0 4(τ+) 3 three types of function σ( ): mean-oriented, d 0 while d 2 = 0: t, t>, sign(t), sin(t) and erf(t) covariance-oriented, d = 0 while d 2 0: t, cos(t), exp( t 2 /2) balanced, d, d 2 0: the ReLU function and quadratic function Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

16 Applications Example : MSE of the random feature-based ridge regression: E train = T y βt F 2 F = γ2 T yt Q 2 ( γ)y, E test = ˆT ŷ β T ˆF 2 F with the ridge regressor β T F (G + γi T ) y T = T FQ( γ)yt. 0 0 E train (Theory) E test (Theory) E train (Simulation) E test (Simulation) MSE σ(t) = t 0 σ(t) = erf(t) σ(t) = max(t, 0) γ Figure: Performance for MNIST data (number 7 and 9), n = 52, T = ˆT = 024, p = 784. Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

17 Applications Example 2: Random-feature based spectral clustering.5 Eigenvalues of Φc Eigenvalues of Φc Leading eigenvector for MNIST data Simulation: mean/std for MNIST data Theory: mean/std for Gaussian data C C 2 Figure: Eigenvalue distribution of Φ c and Φ c and leading eigenvector for MNIST data, with ± standard deviations (from 500 trials). With the ReLU function, p = 784, T = 28 and c = c 2 = /2. Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

18 Applications Example 2: Random-feature based spectral clustering Table: Empirical estimation of (normalized) statistics of the MNIST and epileptic EEG datasets. M T M tt T + 2S MNIST data EEG data Table: Classification accuracies for random feature-based spectral clustering with different σ(t) on the MNIST dataset. Table: Classification accuracies for random feature-based spectral clustering with different σ(t) on the epileptic EEG dataset. σ(t) T = 64 T = 28 t 88.94% 87.30% meanoriented t> % 85.56% sign(t) 83.34% 85.22% sin(t) 87.8% 87.50% erf(t) 87.28% 86.59% t 60.4% 57.8% covoriented cos(t) 59.56% 57.72% exp( t 2 /2) 60.44% 58.67% balanced ReLU(t) 85.72% 82.27% σ(t) T = 64 T = 28 t 70.3% 69.58% meanoriented t> % 63.47% sign(t) 64.63% 63.03% sin(t) 70.34% 68.22% erf(t) 70.59% 67.70% t 99.69% 99.50% covoriented cos(t) 99.38% 99.36% exp( t 2 /2) 99.8% 99.77% balanced ReLU(t) 87.9% 90.97% Much better than ReLU in both cases Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

19 Applications Example 3: Temporal evolution of training and generalization performance Simulation: training performance Theory: training performance Simulation: generalization performance Theory: generalization performance Misclassification rate Epoch (t) Figure: Training and generalization performance for MNIST data (number and 7) with n = p = 784, c = c 2 = /2, α = 0.0 and σ 2 = 0.. Simulation results obtained by averaging over 00 runs. Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

20 Conclusion Take-away messages: SCMs or Gram-like matrices naturally appear in neural nets RMT is a powerful tool in the double-asymptotic regime difficulty from the nonlinear structure: handled with the concentration approach performance analyses and improvements of different algos fast tunning of hyper-parameters wise choice of activation vis-à-vis the data structure more insights into training neural nets Future work: deeper study of inner product kernels and activations in neural nets (relate σ( ) to d, d 2 ) the loss landscape of deep networks (beyond single-layer) Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

21 References On random feature maps and neural nets: C. Louart, Z. Liao, R. Couillet, A Random Matrix Approach to Neural Networks, (in press) Annals of Applied Probability, 207. Z. Liao, R. Couillet, On the Spectrum of Random Features Maps of High Dimensional Data, (submitted to) The 35th International Conference on Machine Learning (ICML 8), Stockholm, Sweden, 208. On the learning dynamics of neural nets: Z. Liao, R. Couillet, The Dynamics of Learning: A Random Matrix Approach, (submitted to) The 35th International Conference on Machine Learning (ICML 8), Stockholm, Sweden, 208. a journal in preparation with the co-supervisor Prof. Yacine CHITOUR On the kernel methods: Z. Liao, R. Couillet, Random Matrices Meet Machine Learning: A Large Dimensional Analysis of LS-SVM, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 7), New Orleans, USA, 207. Z. Liao, R. Couillet, A Large Dimensional Analysis of Least Squares Support Vector Machines, (submitted to) Journal of Machine Learning Research, 207 Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

22 Merci! Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26

On the Spectrum of Random Features Maps of High Dimensional Data

On the Spectrum of Random Features Maps of High Dimensional Data On the Spectrum of Random Features Maps of High Dimensional Data ICML 018, Stockholm, Sweden Zhenyu Liao, Romain Couillet LS, CentraleSupélec, Université Paris-Saclay, France GSTATS IDEX DataScience Chair,

More information

The Dynamics of Learning: A Random Matrix Approach

The Dynamics of Learning: A Random Matrix Approach The Dynamics of Learning: A Random Matrix Approach ICML 2018, Stockholm, Sweden Zhenyu Liao, Romain Couillet L2S, CentraleSupélec, Université Paris-Saclay, France GSTATS IDEX DataScience Chair, GIPSA-lab,

More information

On the Spectrum of Random Features Maps of High Dimensional Data

On the Spectrum of Random Features Maps of High Dimensional Data Zhenyu Liao * Romain Couillet * Abstract Random feature maps are ubiquitous in modern statistical machine learning, where they generalize random projections by means of powerful, yet often difficult to

More information

A Random Matrix Framework for BigData Machine Learning and Applications to Wireless Communications

A Random Matrix Framework for BigData Machine Learning and Applications to Wireless Communications A Random Matrix Framework for BigData Machine Learning and Applications to Wireless Communications (EURECOM) Romain COUILLET CentraleSupélec, France June, 2017 1 / 80 Outline Basics of Random Matrix Theory

More information

A Random Matrix Framework for BigData Machine Learning

A Random Matrix Framework for BigData Machine Learning A Random Matrix Framework for BigData Machine Learning (Groupe Deep Learning, DigiCosme) Romain COUILLET CentraleSupélec, France June, 2017 1 / 63 Outline Basics of Random Matrix Theory Motivation: Large

More information

Random Matrices in Machine Learning

Random Matrices in Machine Learning Random Matrices in Machine Learning Romain COUILLET CentraleSupélec, University of ParisSaclay, France GSTATS IDEX DataScience Chair, GIPSA-lab, University Grenoble Alpes, France. June 21, 2018 1 / 113

More information

Probabilistic Machine Learning. Industrial AI Lab.

Probabilistic Machine Learning. Industrial AI Lab. Probabilistic Machine Learning Industrial AI Lab. Probabilistic Linear Regression Outline Probabilistic Classification Probabilistic Clustering Probabilistic Dimension Reduction 2 Probabilistic Linear

More information

A RANDOM MATRIX APPROACH TO NEURAL NETWORKS. By Cosme Louart, Zhenyu Liao, and Romain Couillet CentraleSupélec, University of Paris Saclay, France.

A RANDOM MATRIX APPROACH TO NEURAL NETWORKS. By Cosme Louart, Zhenyu Liao, and Romain Couillet CentraleSupélec, University of Paris Saclay, France. Submitted to the Annals of Applied Probability A RANDOM MARIX APPROACH O NEURAL NEWORKS By Cosme Louart, Zhenyu Liao, and Romain Couillet CentraleSupélec, University of Paris Saclay, France. his article

More information

Kernel Learning via Random Fourier Representations

Kernel Learning via Random Fourier Representations Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random

More information

Convex Optimization in Classification Problems

Convex Optimization in Classification Problems New Trends in Optimization and Computational Algorithms December 9 13, 2001 Convex Optimization in Classification Problems Laurent El Ghaoui Department of EECS, UC Berkeley elghaoui@eecs.berkeley.edu 1

More information

Convergence of Eigenspaces in Kernel Principal Component Analysis

Convergence of Eigenspaces in Kernel Principal Component Analysis Convergence of Eigenspaces in Kernel Principal Component Analysis Shixin Wang Advanced machine learning April 19, 2016 Shixin Wang Convergence of Eigenspaces April 19, 2016 1 / 18 Outline 1 Motivation

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS LAST TIME Intro to cudnn Deep neural nets using cublas and cudnn TODAY Building a better model for image classification Overfitting

More information

When Dictionary Learning Meets Classification

When Dictionary Learning Meets Classification When Dictionary Learning Meets Classification Bufford, Teresa 1 Chen, Yuxin 2 Horning, Mitchell 3 Shee, Liberty 1 Mentor: Professor Yohann Tendero 1 UCLA 2 Dalhousie University 3 Harvey Mudd College August

More information

Research Program. Romain COUILLET. 11 février CentraleSupélec Université Paris-Sud 11 1 / 38

Research Program. Romain COUILLET. 11 février CentraleSupélec Université Paris-Sud 11 1 / 38 Research Program Romain COUILLET CentraleSupélec Université Paris-Sud 11 11 février 2016 1 / 38 Outline Curriculum Vitae Research Project : Learning in Large Dimensions Axis 1 : Robust Estimation in Large

More information

Classification. Sandro Cumani. Politecnico di Torino

Classification. Sandro Cumani. Politecnico di Torino Politecnico di Torino Outline Generative model: Gaussian classifier (Linear) discriminative model: logistic regression (Non linear) discriminative model: neural networks Gaussian Classifier We want to

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space to The The A s s in to Fabio A. González Ph.D. Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 2, 2009 to The The A s s in 1 Motivation Outline 2 The Mapping the

More information

Random Matrices for Big Data Signal Processing and Machine Learning

Random Matrices for Big Data Signal Processing and Machine Learning Random Matrices for Big Data Signal Processing and Machine Learning (ICASSP 2017, New Orleans) Romain COUILLET and Hafiz TIOMOKO ALI CentraleSupélec, France March, 2017 1 / 153 Outline Basics of Random

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2016 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Normalization Techniques

Normalization Techniques Normalization Techniques Devansh Arpit Normalization Techniques 1 / 39 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization

More information

Linear Classification

Linear Classification Linear Classification Lili MOU moull12@sei.pku.edu.cn http://sei.pku.edu.cn/ moull12 23 April 2015 Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Efficient Complex Output Prediction

Efficient Complex Output Prediction Efficient Complex Output Prediction Florence d Alché-Buc Joint work with Romain Brault, Alex Lambert, Maxime Sangnier October 12, 2017 LTCI, Télécom ParisTech, Institut-Mines Télécom, Université Paris-Saclay

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems

More information

Learning sets and subspaces: a spectral approach

Learning sets and subspaces: a spectral approach Learning sets and subspaces: a spectral approach Alessandro Rudi DIBRIS, Università di Genova Optimization and dynamical processes in Statistical learning and inverse problems Sept 8-12, 2014 A world of

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Porcupine Neural Networks: (Almost) All Local Optima are Global

Porcupine Neural Networks: (Almost) All Local Optima are Global Porcupine Neural Networs: (Almost) All Local Optima are Global Soheil Feizi, Hamid Javadi, Jesse Zhang and David Tse arxiv:1710.0196v1 [stat.ml] 5 Oct 017 Stanford University Abstract Neural networs have

More information

Day 4: Classification, support vector machines

Day 4: Classification, support vector machines Day 4: Classification, support vector machines Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 21 June 2018 Topics so far

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Multisurface Proximal Support Vector Machine Classification via Generalized Eigenvalues

Multisurface Proximal Support Vector Machine Classification via Generalized Eigenvalues Multisurface Proximal Support Vector Machine Classification via Generalized Eigenvalues O. L. Mangasarian and E. W. Wild Presented by: Jun Fang Multisurface Proximal Support Vector Machine Classification

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Convolutional Neural Network Architecture

Convolutional Neural Network Architecture Convolutional Neural Network Architecture Zhisheng Zhong Feburary 2nd, 2018 Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 1 / 55 Outline 1 Introduction of Convolution Motivation

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Kernel Methods. Barnabás Póczos

Kernel Methods. Barnabás Póczos Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels

More information

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam. CS 189 Spring 2013 Introduction to Machine Learning Midterm You have 1 hour 20 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. Please use non-programmable calculators

More information

Computational Graphs, and Backpropagation. Michael Collins, Columbia University

Computational Graphs, and Backpropagation. Michael Collins, Columbia University Computational Graphs, and Backpropagation Michael Collins, Columbia University A Key Problem: Calculating Derivatives where and p(y x; θ, v) = exp (v(y) φ(x; θ) + γ y ) y Y exp (v(y ) φ(x; θ) + γ y ) φ(x;

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

Topic 3: Neural Networks

Topic 3: Neural Networks CS 4850/6850: Introduction to Machine Learning Fall 2018 Topic 3: Neural Networks Instructor: Daniel L. Pimentel-Alarcón c Copyright 2018 3.1 Introduction Neural networks are arguably the main reason why

More information

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

ECE662: Pattern Recognition and Decision Making Processes: HW TWO ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are

More information

Lecture 7: Kernels for Classification and Regression

Lecture 7: Kernels for Classification and Regression Lecture 7: Kernels for Classification and Regression CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 15, 2011 Outline Outline A linear regression problem Linear auto-regressive

More information

Support Vector Machines

Support Vector Machines Support Vector Machines INFO-4604, Applied Machine Learning University of Colorado Boulder September 28, 2017 Prof. Michael Paul Today Two important concepts: Margins Kernels Large Margin Classification

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

More information

Support Vector Method for Multivariate Density Estimation

Support Vector Method for Multivariate Density Estimation Support Vector Method for Multivariate Density Estimation Vladimir N. Vapnik Royal Halloway College and AT &T Labs, 100 Schultz Dr. Red Bank, NJ 07701 vlad@research.att.com Sayan Mukherjee CBCL, MIT E25-201

More information

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable

More information

Overparametrization for Landscape Design in Non-convex Optimization

Overparametrization for Landscape Design in Non-convex Optimization Overparametrization for Landscape Design in Non-convex Optimization Jason D. Lee University of Southern California September 19, 2018 The State of Non-Convex Optimization Practical observation: Empirically,

More information

Regularized Least Squares

Regularized Least Squares Regularized Least Squares Ryan M. Rifkin Google, Inc. 2008 Basics: Data Data points S = {(X 1, Y 1 ),...,(X n, Y n )}. We let X simultaneously refer to the set {X 1,...,X n } and to the n by d matrix whose

More information

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal IPAM Summer School 2012 Tutorial on Optimization methods for machine learning Jorge Nocedal Northwestern University Overview 1. We discuss some characteristics of optimization problems arising in deep

More information

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization

More information

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:

More information

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM Wei Chu Chong Jin Ong chuwei@gatsby.ucl.ac.uk mpeongcj@nus.edu.sg S. Sathiya Keerthi mpessk@nus.edu.sg Control Division, Department

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Neural networks and support vector machines

Neural networks and support vector machines Neural netorks and support vector machines Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith

More information

Normalization Techniques in Training of Deep Neural Networks

Normalization Techniques in Training of Deep Neural Networks Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,

More information

Support Vector Machines: Kernels

Support Vector Machines: Kernels Support Vector Machines: Kernels CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 14.1, 14.2, 14.4 Schoelkopf/Smola Chapter 7.4, 7.6, 7.8 Non-Linear Problems

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space Robert Jenssen, Deniz Erdogmus 2, Jose Principe 2, Torbjørn Eltoft Department of Physics, University of Tromsø, Norway

More information

A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

Convex Optimization M2

Convex Optimization M2 Convex Optimization M2 Lecture 8 A. d Aspremont. Convex Optimization M2. 1/57 Applications A. d Aspremont. Convex Optimization M2. 2/57 Outline Geometrical problems Approximation problems Combinatorial

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Lecture 10: A brief introduction to Support Vector Machine

Lecture 10: A brief introduction to Support Vector Machine Lecture 10: A brief introduction to Support Vector Machine Advanced Applied Multivariate Analysis STAT 2221, Fall 2013 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Kriging models with Gaussian processes - covariance function estimation and impact of spatial sampling

Kriging models with Gaussian processes - covariance function estimation and impact of spatial sampling Kriging models with Gaussian processes - covariance function estimation and impact of spatial sampling François Bachoc former PhD advisor: Josselin Garnier former CEA advisor: Jean-Marc Martinez Department

More information

Adaptive Gradient Methods AdaGrad / Adam

Adaptive Gradient Methods AdaGrad / Adam Case Study 1: Estimating Click Probabilities Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 The Problem with GD (and SGD)

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN , Sparse Kernel Canonical Correlation Analysis Lili Tan and Colin Fyfe 2, Λ. Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong. 2. School of Information and Communication

More information

Slide credit from Hung-Yi Lee & Richard Socher

Slide credit from Hung-Yi Lee & Richard Socher Slide credit from Hung-Yi Lee & Richard Socher 1 Review Recurrent Neural Network 2 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step

More information

Fitting Linear Statistical Models to Data by Least Squares I: Introduction

Fitting Linear Statistical Models to Data by Least Squares I: Introduction Fitting Linear Statistical Models to Data by Least Squares I: Introduction Brian R. Hunt and C. David Levermore University of Maryland, College Park Math 420: Mathematical Modeling January 25, 2012 version

More information

Brief Introduction to Machine Learning

Brief Introduction to Machine Learning Brief Introduction to Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU August 29, 2016 1 / 49 1 Introduction 2 Binary Classification 3 Support Vector

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Matrix Notation Mark Schmidt University of British Columbia Winter 2017 Admin Auditting/registration forms: Submit them at end of class, pick them up end of next class. I need

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

c 4, < y 2, 1 0, otherwise,

c 4, < y 2, 1 0, otherwise, Fundamentals of Big Data Analytics Univ.-Prof. Dr. rer. nat. Rudolf Mathar Problem. Probability theory: The outcome of an experiment is described by three events A, B and C. The probabilities Pr(A) =,

More information

Support Vector Machines

Support Vector Machines Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)

More information

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

More information

Feedforward Neural Networks. Michael Collins, Columbia University

Feedforward Neural Networks. Michael Collins, Columbia University Feedforward Neural Networks Michael Collins, Columbia University Recap: Log-linear Models A log-linear model takes the following form: p(y x; v) = exp (v f(x, y)) y Y exp (v f(x, y )) f(x, y) is the representation

More information

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

Large sample covariance matrices and the T 2 statistic

Large sample covariance matrices and the T 2 statistic Large sample covariance matrices and the T 2 statistic EURANDOM, the Netherlands Joint work with W. Zhou Outline 1 2 Basic setting Let {X ij }, i, j =, be i.i.d. r.v. Write n s j = (X 1j,, X pj ) T and

More information

Discussion About Nonlinear Time Series Prediction Using Least Squares Support Vector Machine

Discussion About Nonlinear Time Series Prediction Using Least Squares Support Vector Machine Commun. Theor. Phys. (Beijing, China) 43 (2005) pp. 1056 1060 c International Academic Publishers Vol. 43, No. 6, June 15, 2005 Discussion About Nonlinear Time Series Prediction Using Least Squares Support

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning 10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date

More information

Introduction to (Convolutional) Neural Networks

Introduction to (Convolutional) Neural Networks Introduction to (Convolutional) Neural Networks Philipp Grohs Summer School DL and Vis, Sept 2018 Syllabus 1 Motivation and Definition 2 Universal Approximation 3 Backpropagation 4 Stochastic Gradient

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information