Random Matrix Theory for Neural Networks
|
|
- Allan Gordon
- 5 years ago
- Views:
Transcription
1 Random Matrix Theory for Neural Networks Ph.D. Mid-Term Evaluation Zhenyu Liao Laboratoire des Signaux et Systèmes CentraleSupélec Université Paris-Saclay Salle sd.207, Bâtiment Bouygues Gif-sur-Yvette, France Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, 208 / 26
2 Outline Curriculum Vitae 2 Introduction 3 Summary of main results 4 Conclusion Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26
3 Curriculum Vitae Education Ph.D. in Statistics and Signal Processing, L2S, CentraleSupélec, France 206-present Thesis: Random Matrix Theory for Neural Networks. Supervisor: Prof. Romain Couillet, Prof. Yacine Chitour. M.Sc. in Signal and Image Processing, CentraleSupélec/Paris-Sud, France B.Sc. in Electronic Engineering, Paris-Sud, France B.Sc. in Optical & Electronic Information, HUST, Wuhan, China Ph.D. training Scientific training [completed] Random matrix theory and machine learning application: 27 hours. Summer school of signal and image processing in Peyresq: 2 hours. Professional training [completed] Law and intellectual property: 8 hours. European projects Horizon 2020: 8 hours. Techniques for scientific writing and associated softwares: 5 hours. Teaching : Lab work of Signal and System, with Prof. Laurent Le Brusquet, Department of Signal and Statistics, CentraleSupélec: 54 hours. Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26
4 Curriculum Vitae Review activities IEEE Transactions on Signal Processing Neural Processing Letters Publications Conferences Z. Liao, R. Couillet, The Dynamics of Learning: A Random Matrix Approach, (submitted to) The 35th International Conference on Machine Learning (ICML 208), Stockholm, Sweden, 208. Z. Liao, R. Couillet, On the Spectrum of Random Features Maps of High Dimensional Data, (submitted to) The 35th International Conference on Machine Learning (ICML 208), Stockholm, Sweden, 208. Z. Liao, R. Couillet, Une Analyse des Méthodes de Projections Aléatoires par la Théorie des Matrices Aléatoires (in French), Colloque GRETSI 7, Juan Les Pins, France, 207. Z. Liao, R. Couillet, Random Matrices Meet Machine Learning: A Large Dimensional Analysis of LS-SVM, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 7), New Orleans, USA, 207. Journals C. Louart, Z. Liao, R. Couillet, A Random Matrix Approach to Neural Networks, (in press) Annals of Applied Probability, 207. Z. Liao, R. Couillet, A Large Dimensional Analysis of Least Squares Support Vector Machines, (submitted to) Journal of Machine Learning Research, 207. Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26
5 Motivation: features in machine learning Learning = Representation + Evaluation + Optimization. Features: representation of the data that contains crucial information. Various methods for feature extraction: feature selection by hand feature learned via backpropagation random feature maps How to study and understand these features? Sample Covariance Matrix SCM T XXT of some data X = [x,..., x T ] R p T. SCM in feature space feature Gram matrix G: G T FT F with F = [f(x ),..., f(x T )] feature matrix of data X = [x,..., x T ]. Domingos, Pedro. A few useful things to know about machine learning. Communications of the ACM 55.0 (202): Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26
6 Motivation: in random feature maps (RFM) SCM in feature space feature Gram matrix G: G T FT F with F R n T feature matrix of data X. In RFM, G determines training and test performance via its resolvent Q(z) (G zi T ). Example : MSE of RFM-based ridge regression (also called extreme learning machines): E train = T y βt F 2 F = γ2 T yt Q 2 ( γ)y E test = ˆT ŷ β T ˆF 2 F with ridge regressor β T F (G + γi T ) y T = T FQ( γ)yt and regularization γ > 0. y associated target of training data X and ŷ target of test data ˆX. Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26
7 Motivation: gradient descent dynamics Example 2: Feature matrix F with associated labels y, a classifier trained by GD is to minimize L(w) = 2T y wt F 2 with constant small learning rate α, GD dynamics yields: w(t) t = α wl = α ( T F(y FT w(t)) w(t) = e αt T FFT w 0 + I p e αt FFT) T (FF T ) Fy. Generalization performance of new feature ˆf related to: w(t) Tˆf = f t(z)ˆf T Q(z)w 0 dz 2πi 2πi γ γ f t(z) T ˆf Q(z) Fy dz z T with f t(z) = exp( αtz), Q resolvent of G = T FFT, γ contours all eigenvalues of G. Important remark Both examples: study of eigenspectrum through resolvent of SCM-like matrices (G or G), especially in high dimensional regime today where n, p, T are comparably large. Random Matrix Theory (RMT) is the answer! Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26
8 Motivation: difficulty from nonlinearity However, highly nonlinear and abstract nature of real-world problems classical RMT results cannot apply directly in need of new tools to adapt to nonlinear models Example: random feature maps data W R n p σ( ) entry-wise features x R p σ(wx) R n Figure: Illustration of random feature maps with random matrix W R n p with i.i.d. entries and (nonlinear) activation function σ( ). Classical RMT results essentially based on trace lemma: for A R n n of bounded operator norm and random vector w R n with i.i.d. entries n wt Aw n tr(a) 0 almost surely as n. Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26
9 Motivation: difficulty from nonlinearity Example: random feature maps data W R n p σ( ) entry-wise features x R p σ(wx) R n Figure: Illustration of random feature maps Classical RMT results are essentially based on the trace lemma: for A R n n of bounded operator norm and random vector w R n with i.i.d. entries n wt Aw n tr(a) 0 almost surely as n. However, object under study n σ(wt X)Aσ(X T w): loss of independence between entries more elusive due to σ( ) One major objective of the thesis: handle nonlinearity in RMT Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, 208 / 26
10 Handle nonlinearity in RMT: concentration Lemma: Concentration of quadratic forms For w N (0, I p), X, A bounded and σ( ) Lipschitz, as n, p with p n c, ( P n σ(wt X)Aσ(X T w) ) n tr(φa) > t Ce cn min(t,t2 ) where Φ E w [ σ(x T w)σ(w T X) ]. [ ] Φ(a, b) E w σ(w T a)σ(w T b) = (2π) p 2 σ(w T a)σ(w T b)e 2 w 2 dw R p = σ( w T ã)σ( w T b)e 2 w 2 d w (projection on span(a, b)) 2π R 2 For σ(t) = max(t, 0) ReLU(t), Φ(a, b) = w T ã w T b e 2 w 2 d w = ( ) 2π 2π a b 2 + arccos ( ) S where S min ( w T ã, w T b ) > 0, at b a b. Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26
11 Handle nonlinearity in RMT: concentration Behavior of Φ as a function of X: concentration phenomenon for large n, p, T For ReLU function, Φ(x i, x j ) = ( ) 2π x i x j 2 (x i, x j ) + (x i, x j ) arccos ( (x i, x j )) Assumption: GMM data ( ) For i =,..., T, x i N p µ a, p Ca has cardinality T a, for a =,..., K. and x i = p µ a + ω i, with ω i N (0, Ca), class Ca p x T i x j = ( µ T a / p + ω T i ) (µb / p + ω j ) = µ T a µ b /p + (µt a ω j + µ T b ω i)/ p + ω T i ω j x T i x i = ( µ T a / p + ω T i ) (µa / p + ω i ) = µ a 2 /p + 2µ T a ω i/ p + ω i 2 Growth rate [Neyman-Pearson Minimal] As n, p, T, let C K T a a= T C a and for a =,..., K, C a C a C, the Euclidean norm µ a = O() the operator norm C a = O() and tr(c a) = O( p) Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26
12 Handle nonlinearity in RMT: concentration Growth rate [Neyman-Pearson Minimal] As n, p, T, let C K T a a= T C a and for a =,..., K, C a C a C, the Euclidean norm µ a = O() the operator norm C a = O() and tr(c a) = O( p) ( ) ( ) xi T x j = µ T p a + ωt i µ p b + ω j = p µt a µ b + (µ T p a ω j + µ T b ω i) + ω T i ω j }{{}}{{} ( ) ( xi T x i = µ T p a + ωt i O(p ) O(p ) ) µ p a + ω i = p µ a µ T p a ω i + ω i 2 = p µ a µ T p a ω i + ω i 2 p tr(ca) + p tr(ca ) + τ }{{}}{{}}{{} O() with E ωi ω i 2 = tr(c a)/p and τ tr(c )/p. O(p /2 ) g(x T i x i) concentrates around g(τ) Taylor expansion O(p /2 ) Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26
13 Main results Theorem (Asymptotic equivalent of Φ c ) [ ] Recenter Φ E w σ(x T w)σ(w T X) to get Φ c PΦP, with P I T T T T T. As T, Φ c Φ c 0 almost surely, with Φ c P ΦP and Φ d ( Ω + M JT p ) T ( Ω + M JT p ) + d 2 UBU T + d 0 I T as well as U [ J p, φ ], B [ tt T + 2S ] t t T [ ] { [ where Ω ω,..., ω T, φ ωi 2 E ω i 2]} T, and data i= statistics2, [ ] M µ,..., µ K, t {tr C a / [ ] p} K a=, S {tr(cac b)/p} K a,b=, J j,..., j K where j a R T denotes the canonical vector of class C a such that (j a) i = δ xi C a. 2 M for means, t for (difference in) traces while S for the shapes of covariances. Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26
14 Coefficients d i for different activation functions Table: Φ(a, b) for different σ( ), (a, b) at b a b. σ(t) Φ(a, b) t a T b ) max(t, 0) ReLU(t) 2π ( (a, a b b) arccos ( (a, b)) + (a, b) ) 2 2 t π ( (a, a b b) arcsin ( (a, b)) + (a, b) 2 t>0 2 2π arccos ( (a, b)) 2 sign(t) ( π arcsin ( (a, b)) ( ς 2t 2 + ς t + ς 0 ς a T b cos(t) sin(t) ) ) 2 ( + a 2 b 2 + ς 2 at b + ς 2ς 0 a 2 + b 2) + ς 2 0 ( ( exp 2 a 2 + b 2)) cosh(a T b) ( ( exp 2 a 2 + b 2)) sinh(a T b) ) ( 2 erf(t) π arcsin 2a T b (+2 a 2 )(+2 b 2 ) exp( t2 2 ) (+ a 2 )(+ b 2 ) (a T b) 2 Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26
15 Coefficients d i for different activation functions Table: Coefficients d i in Φ c for different σ( ). σ(t) d 0 d d 2 t ( 0 0 max(t, 0) ReLU(t) ) 4 2π τ ( ) 4 8πτ t 2 π τ 0 2πτ t>0 4 2π 2πτ 0 sign(t) 2 2 π πτ 0 ς 2t 2 + ς t + ς 0 2τ 2 ς 2 2 ς 2 ς 2 2 cos(t) 2 + e 2τ 2 e τ e 0 τ 4 sin(t) ( 2 e 2τ ( 2 τe τ e τ 0 2 erf(t) π arccos 2τ ) ) 2τ+ 2τ 4 2τ+ π 2τ+ 0 exp( t2 2 ) 2τ+ τ+ 0 4(τ+) 3 three types of function σ( ): mean-oriented, d 0 while d 2 = 0: t, t>, sign(t), sin(t) and erf(t) covariance-oriented, d = 0 while d 2 0: t, cos(t), exp( t 2 /2) balanced, d, d 2 0: the ReLU function and quadratic function Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26
16 Applications Example : MSE of the random feature-based ridge regression: E train = T y βt F 2 F = γ2 T yt Q 2 ( γ)y, E test = ˆT ŷ β T ˆF 2 F with the ridge regressor β T F (G + γi T ) y T = T FQ( γ)yt. 0 0 E train (Theory) E test (Theory) E train (Simulation) E test (Simulation) MSE σ(t) = t 0 σ(t) = erf(t) σ(t) = max(t, 0) γ Figure: Performance for MNIST data (number 7 and 9), n = 52, T = ˆT = 024, p = 784. Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26
17 Applications Example 2: Random-feature based spectral clustering.5 Eigenvalues of Φc Eigenvalues of Φc Leading eigenvector for MNIST data Simulation: mean/std for MNIST data Theory: mean/std for Gaussian data C C 2 Figure: Eigenvalue distribution of Φ c and Φ c and leading eigenvector for MNIST data, with ± standard deviations (from 500 trials). With the ReLU function, p = 784, T = 28 and c = c 2 = /2. Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26
18 Applications Example 2: Random-feature based spectral clustering Table: Empirical estimation of (normalized) statistics of the MNIST and epileptic EEG datasets. M T M tt T + 2S MNIST data EEG data Table: Classification accuracies for random feature-based spectral clustering with different σ(t) on the MNIST dataset. Table: Classification accuracies for random feature-based spectral clustering with different σ(t) on the epileptic EEG dataset. σ(t) T = 64 T = 28 t 88.94% 87.30% meanoriented t> % 85.56% sign(t) 83.34% 85.22% sin(t) 87.8% 87.50% erf(t) 87.28% 86.59% t 60.4% 57.8% covoriented cos(t) 59.56% 57.72% exp( t 2 /2) 60.44% 58.67% balanced ReLU(t) 85.72% 82.27% σ(t) T = 64 T = 28 t 70.3% 69.58% meanoriented t> % 63.47% sign(t) 64.63% 63.03% sin(t) 70.34% 68.22% erf(t) 70.59% 67.70% t 99.69% 99.50% covoriented cos(t) 99.38% 99.36% exp( t 2 /2) 99.8% 99.77% balanced ReLU(t) 87.9% 90.97% Much better than ReLU in both cases Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26
19 Applications Example 3: Temporal evolution of training and generalization performance Simulation: training performance Theory: training performance Simulation: generalization performance Theory: generalization performance Misclassification rate Epoch (t) Figure: Training and generalization performance for MNIST data (number and 7) with n = p = 784, c = c 2 = /2, α = 0.0 and σ 2 = 0.. Simulation results obtained by averaging over 00 runs. Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26
20 Conclusion Take-away messages: SCMs or Gram-like matrices naturally appear in neural nets RMT is a powerful tool in the double-asymptotic regime difficulty from the nonlinear structure: handled with the concentration approach performance analyses and improvements of different algos fast tunning of hyper-parameters wise choice of activation vis-à-vis the data structure more insights into training neural nets Future work: deeper study of inner product kernels and activations in neural nets (relate σ( ) to d, d 2 ) the loss landscape of deep networks (beyond single-layer) Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26
21 References On random feature maps and neural nets: C. Louart, Z. Liao, R. Couillet, A Random Matrix Approach to Neural Networks, (in press) Annals of Applied Probability, 207. Z. Liao, R. Couillet, On the Spectrum of Random Features Maps of High Dimensional Data, (submitted to) The 35th International Conference on Machine Learning (ICML 8), Stockholm, Sweden, 208. On the learning dynamics of neural nets: Z. Liao, R. Couillet, The Dynamics of Learning: A Random Matrix Approach, (submitted to) The 35th International Conference on Machine Learning (ICML 8), Stockholm, Sweden, 208. a journal in preparation with the co-supervisor Prof. Yacine CHITOUR On the kernel methods: Z. Liao, R. Couillet, Random Matrices Meet Machine Learning: A Large Dimensional Analysis of LS-SVM, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 7), New Orleans, USA, 207. Z. Liao, R. Couillet, A Large Dimensional Analysis of Least Squares Support Vector Machines, (submitted to) Journal of Machine Learning Research, 207 Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26
22 Merci! Z. Liao (L2S, CentraleSupélec) RMT for Neural Networks Ph.D. Mid-Term Evaluation, / 26
On the Spectrum of Random Features Maps of High Dimensional Data
On the Spectrum of Random Features Maps of High Dimensional Data ICML 018, Stockholm, Sweden Zhenyu Liao, Romain Couillet LS, CentraleSupélec, Université Paris-Saclay, France GSTATS IDEX DataScience Chair,
More informationThe Dynamics of Learning: A Random Matrix Approach
The Dynamics of Learning: A Random Matrix Approach ICML 2018, Stockholm, Sweden Zhenyu Liao, Romain Couillet L2S, CentraleSupélec, Université Paris-Saclay, France GSTATS IDEX DataScience Chair, GIPSA-lab,
More informationOn the Spectrum of Random Features Maps of High Dimensional Data
Zhenyu Liao * Romain Couillet * Abstract Random feature maps are ubiquitous in modern statistical machine learning, where they generalize random projections by means of powerful, yet often difficult to
More informationA Random Matrix Framework for BigData Machine Learning and Applications to Wireless Communications
A Random Matrix Framework for BigData Machine Learning and Applications to Wireless Communications (EURECOM) Romain COUILLET CentraleSupélec, France June, 2017 1 / 80 Outline Basics of Random Matrix Theory
More informationA Random Matrix Framework for BigData Machine Learning
A Random Matrix Framework for BigData Machine Learning (Groupe Deep Learning, DigiCosme) Romain COUILLET CentraleSupélec, France June, 2017 1 / 63 Outline Basics of Random Matrix Theory Motivation: Large
More informationRandom Matrices in Machine Learning
Random Matrices in Machine Learning Romain COUILLET CentraleSupélec, University of ParisSaclay, France GSTATS IDEX DataScience Chair, GIPSA-lab, University Grenoble Alpes, France. June 21, 2018 1 / 113
More informationProbabilistic Machine Learning. Industrial AI Lab.
Probabilistic Machine Learning Industrial AI Lab. Probabilistic Linear Regression Outline Probabilistic Classification Probabilistic Clustering Probabilistic Dimension Reduction 2 Probabilistic Linear
More informationA RANDOM MATRIX APPROACH TO NEURAL NETWORKS. By Cosme Louart, Zhenyu Liao, and Romain Couillet CentraleSupélec, University of Paris Saclay, France.
Submitted to the Annals of Applied Probability A RANDOM MARIX APPROACH O NEURAL NEWORKS By Cosme Louart, Zhenyu Liao, and Romain Couillet CentraleSupélec, University of Paris Saclay, France. his article
More informationKernel Learning via Random Fourier Representations
Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random
More informationConvex Optimization in Classification Problems
New Trends in Optimization and Computational Algorithms December 9 13, 2001 Convex Optimization in Classification Problems Laurent El Ghaoui Department of EECS, UC Berkeley elghaoui@eecs.berkeley.edu 1
More informationConvergence of Eigenspaces in Kernel Principal Component Analysis
Convergence of Eigenspaces in Kernel Principal Component Analysis Shixin Wang Advanced machine learning April 19, 2016 Shixin Wang Convergence of Eigenspaces April 19, 2016 1 / 18 Outline 1 Motivation
More informationBits of Machine Learning Part 1: Supervised Learning
Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationCS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS
CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS LAST TIME Intro to cudnn Deep neural nets using cublas and cudnn TODAY Building a better model for image classification Overfitting
More informationWhen Dictionary Learning Meets Classification
When Dictionary Learning Meets Classification Bufford, Teresa 1 Chen, Yuxin 2 Horning, Mitchell 3 Shee, Liberty 1 Mentor: Professor Yohann Tendero 1 UCLA 2 Dalhousie University 3 Harvey Mudd College August
More informationResearch Program. Romain COUILLET. 11 février CentraleSupélec Université Paris-Sud 11 1 / 38
Research Program Romain COUILLET CentraleSupélec Université Paris-Sud 11 11 février 2016 1 / 38 Outline Curriculum Vitae Research Project : Learning in Large Dimensions Axis 1 : Robust Estimation in Large
More informationClassification. Sandro Cumani. Politecnico di Torino
Politecnico di Torino Outline Generative model: Gaussian classifier (Linear) discriminative model: logistic regression (Non linear) discriminative model: neural networks Gaussian Classifier We want to
More informationSupport Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2
More informationOutline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space
to The The A s s in to Fabio A. González Ph.D. Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 2, 2009 to The The A s s in 1 Motivation Outline 2 The Mapping the
More informationRandom Matrices for Big Data Signal Processing and Machine Learning
Random Matrices for Big Data Signal Processing and Machine Learning (ICASSP 2017, New Orleans) Romain COUILLET and Hafiz TIOMOKO ALI CentraleSupélec, France March, 2017 1 / 153 Outline Basics of Random
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2016 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationNormalization Techniques
Normalization Techniques Devansh Arpit Normalization Techniques 1 / 39 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization
More informationLinear Classification
Linear Classification Lili MOU moull12@sei.pku.edu.cn http://sei.pku.edu.cn/ moull12 23 April 2015 Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationLearning gradients: prescriptive models
Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationSVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels
SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationEfficient Complex Output Prediction
Efficient Complex Output Prediction Florence d Alché-Buc Joint work with Romain Brault, Alex Lambert, Maxime Sangnier October 12, 2017 LTCI, Télécom ParisTech, Institut-Mines Télécom, Université Paris-Saclay
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationOutline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22
Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems
More informationLearning sets and subspaces: a spectral approach
Learning sets and subspaces: a spectral approach Alessandro Rudi DIBRIS, Università di Genova Optimization and dynamical processes in Statistical learning and inverse problems Sept 8-12, 2014 A world of
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationPorcupine Neural Networks: (Almost) All Local Optima are Global
Porcupine Neural Networs: (Almost) All Local Optima are Global Soheil Feizi, Hamid Javadi, Jesse Zhang and David Tse arxiv:1710.0196v1 [stat.ml] 5 Oct 017 Stanford University Abstract Neural networs have
More informationDay 4: Classification, support vector machines
Day 4: Classification, support vector machines Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 21 June 2018 Topics so far
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationMultisurface Proximal Support Vector Machine Classification via Generalized Eigenvalues
Multisurface Proximal Support Vector Machine Classification via Generalized Eigenvalues O. L. Mangasarian and E. W. Wild Presented by: Jun Fang Multisurface Proximal Support Vector Machine Classification
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationConvolutional Neural Network Architecture
Convolutional Neural Network Architecture Zhisheng Zhong Feburary 2nd, 2018 Zhisheng Zhong Convolutional Neural Network Architecture Feburary 2nd, 2018 1 / 55 Outline 1 Introduction of Convolution Motivation
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationKernel Methods. Barnabás Póczos
Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels
More informationMidterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.
CS 189 Spring 2013 Introduction to Machine Learning Midterm You have 1 hour 20 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. Please use non-programmable calculators
More informationComputational Graphs, and Backpropagation. Michael Collins, Columbia University
Computational Graphs, and Backpropagation Michael Collins, Columbia University A Key Problem: Calculating Derivatives where and p(y x; θ, v) = exp (v(y) φ(x; θ) + γ y ) y Y exp (v(y ) φ(x; θ) + γ y ) φ(x;
More informationApprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning
Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire
More informationStatistical Machine Learning
Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x
More informationTopic 3: Neural Networks
CS 4850/6850: Introduction to Machine Learning Fall 2018 Topic 3: Neural Networks Instructor: Daniel L. Pimentel-Alarcón c Copyright 2018 3.1 Introduction Neural networks are arguably the main reason why
More informationECE662: Pattern Recognition and Decision Making Processes: HW TWO
ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are
More informationLecture 7: Kernels for Classification and Regression
Lecture 7: Kernels for Classification and Regression CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 15, 2011 Outline Outline A linear regression problem Linear auto-regressive
More informationSupport Vector Machines
Support Vector Machines INFO-4604, Applied Machine Learning University of Colorado Boulder September 28, 2017 Prof. Michael Paul Today Two important concepts: Margins Kernels Large Margin Classification
More informationLeast Absolute Shrinkage is Equivalent to Quadratic Penalization
Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr
More informationSupport Vector Method for Multivariate Density Estimation
Support Vector Method for Multivariate Density Estimation Vladimir N. Vapnik Royal Halloway College and AT &T Labs, 100 Schultz Dr. Red Bank, NJ 07701 vlad@research.att.com Sayan Mukherjee CBCL, MIT E25-201
More informationNeural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann
Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable
More informationOverparametrization for Landscape Design in Non-convex Optimization
Overparametrization for Landscape Design in Non-convex Optimization Jason D. Lee University of Southern California September 19, 2018 The State of Non-Convex Optimization Practical observation: Empirically,
More informationRegularized Least Squares
Regularized Least Squares Ryan M. Rifkin Google, Inc. 2008 Basics: Data Data points S = {(X 1, Y 1 ),...,(X n, Y n )}. We let X simultaneously refer to the set {X 1,...,X n } and to the n by d matrix whose
More informationIPAM Summer School Optimization methods for machine learning. Jorge Nocedal
IPAM Summer School 2012 Tutorial on Optimization methods for machine learning Jorge Nocedal Northwestern University Overview 1. We discuss some characteristics of optimization problems arising in deep
More informationSTATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION
STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization
More informationKernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning
Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:
More informationAn Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM
An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM Wei Chu Chong Jin Ong chuwei@gatsby.ucl.ac.uk mpeongcj@nus.edu.sg S. Sathiya Keerthi mpessk@nus.edu.sg Control Division, Department
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationNeural networks and support vector machines
Neural netorks and support vector machines Perceptron Input x 1 Weights 1 x 2 x 3... x D 2 3 D Output: sgn( x + b) Can incorporate bias as component of the eight vector by alays including a feature ith
More informationNormalization Techniques in Training of Deep Neural Networks
Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,
More informationSupport Vector Machines: Kernels
Support Vector Machines: Kernels CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 14.1, 14.2, 14.4 Schoelkopf/Smola Chapter 7.4, 7.6, 7.8 Non-Linear Problems
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationThe Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space
The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space Robert Jenssen, Deniz Erdogmus 2, Jose Principe 2, Torbjørn Eltoft Department of Physics, University of Tromsø, Norway
More informationA Magiv CV Theory for Large-Margin Classifiers
A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector
More informationMachine Learning - MT & 5. Basis Expansion, Regularization, Validation
Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships
More informationLearning theory. Ensemble methods. Boosting. Boosting: history
Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over
More informationConvex Optimization M2
Convex Optimization M2 Lecture 8 A. d Aspremont. Convex Optimization M2. 1/57 Applications A. d Aspremont. Convex Optimization M2. 2/57 Outline Geometrical problems Approximation problems Combinatorial
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationKernel Methods. Machine Learning A W VO
Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationLecture 10: A brief introduction to Support Vector Machine
Lecture 10: A brief introduction to Support Vector Machine Advanced Applied Multivariate Analysis STAT 2221, Fall 2013 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationKriging models with Gaussian processes - covariance function estimation and impact of spatial sampling
Kriging models with Gaussian processes - covariance function estimation and impact of spatial sampling François Bachoc former PhD advisor: Josselin Garnier former CEA advisor: Jean-Marc Martinez Department
More informationAdaptive Gradient Methods AdaGrad / Adam
Case Study 1: Estimating Click Probabilities Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 The Problem with GD (and SGD)
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,
Sparse Kernel Canonical Correlation Analysis Lili Tan and Colin Fyfe 2, Λ. Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong. 2. School of Information and Communication
More informationSlide credit from Hung-Yi Lee & Richard Socher
Slide credit from Hung-Yi Lee & Richard Socher 1 Review Recurrent Neural Network 2 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step
More informationFitting Linear Statistical Models to Data by Least Squares I: Introduction
Fitting Linear Statistical Models to Data by Least Squares I: Introduction Brian R. Hunt and C. David Levermore University of Maryland, College Park Math 420: Mathematical Modeling January 25, 2012 version
More informationBrief Introduction to Machine Learning
Brief Introduction to Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU August 29, 2016 1 / 49 1 Introduction 2 Binary Classification 3 Support Vector
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Matrix Notation Mark Schmidt University of British Columbia Winter 2017 Admin Auditting/registration forms: Submit them at end of class, pick them up end of next class. I need
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationSupport Vector Machines.
Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationc 4, < y 2, 1 0, otherwise,
Fundamentals of Big Data Analytics Univ.-Prof. Dr. rer. nat. Rudolf Mathar Problem. Probability theory: The outcome of an experiment is described by three events A, B and C. The probabilities Pr(A) =,
More informationSupport Vector Machines
Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)
More informationA summary of Deep Learning without Poor Local Minima
A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given
More informationFeedforward Neural Networks. Michael Collins, Columbia University
Feedforward Neural Networks Michael Collins, Columbia University Recap: Log-linear Models A log-linear model takes the following form: p(y x; v) = exp (v f(x, y)) y Y exp (v f(x, y )) f(x, y) is the representation
More informationA GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong
A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,
More informationEE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015
EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,
More informationLarge sample covariance matrices and the T 2 statistic
Large sample covariance matrices and the T 2 statistic EURANDOM, the Netherlands Joint work with W. Zhou Outline 1 2 Basic setting Let {X ij }, i, j =, be i.i.d. r.v. Write n s j = (X 1j,, X pj ) T and
More informationDiscussion About Nonlinear Time Series Prediction Using Least Squares Support Vector Machine
Commun. Theor. Phys. (Beijing, China) 43 (2005) pp. 1056 1060 c International Academic Publishers Vol. 43, No. 6, June 15, 2005 Discussion About Nonlinear Time Series Prediction Using Least Squares Support
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII
More informationAdvanced Introduction to Machine Learning
10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date
More informationIntroduction to (Convolutional) Neural Networks
Introduction to (Convolutional) Neural Networks Philipp Grohs Summer School DL and Vis, Sept 2018 Syllabus 1 Motivation and Definition 2 Universal Approximation 3 Backpropagation 4 Stochastic Gradient
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More information