Weighted Principal Support Vector Machines for Sufficient Dimension Reduction in Binary Classification
|
|
- Moris Leonard
- 6 years ago
- Views:
Transcription
1 s for Sufficient Dimension Reduction in Binary Classification A joint work with Seung Jun Shin, Hao Helen Zhang and Yufeng Liu
2 Outline Introduction 1 Introduction
3 Outline Introduction 1 Introduction
4 Sufficient Dimension Reduction For a given pair of (Y,X) R R p, Sufficient Dimension Reduction (SDR) seeks a matrix B = (b 1,,b d ) R p d which satisfies Y X B X. (1)
5 Central Subspace Dimension Reduction Subspace (DRS) is defined by span(b) R p. Central Subspace Central Subspace, S Y X is the intersection of all DRSes. S Y X has a minimum dimension among all DRS and uniquely exists under very mild conditions. (Cook, 1998, Prop. 6.4) We assume S Y X = span(b). The dimension of S Y X, d, is called the structure dimension.
6 Estimation of S Y X Seminal paper in Early K-C Li (1991) Sliced Inverse Regression for Dimension Reduction (with discussion). JASA, 86, Many other methods: Sliced Average Variance Estimation (SAVE, 1991) Principal Hessian Directions (phd, 1992) Contour Regression (2005) Fourier-Transformation-Based Estimation (2005) Directional Regression (2007) Cumulative Sliced Regression (CUME; 2008) and many others...
7 Sliced Inverse Regression Foundation of SIR Under the linearity condition, E(Z Y) S Y Z = Σ 1/2 S Y X. where Z = Σ 1/2 {X E(X)}. Slice response into H non-overlapping intervals, I 1,,I H, ˆm h := E n (Z Y I h ) = 1 n h Y I h z i, h = 1,,H. B is estimated by premultiplying first d leading eigenvectors of H h=1 ˆm h ˆm h by ˆΣ 1/2.
8 SIR with Binary Response If Y { 1,1} is binary: Only one possible choice to slice. I 1 = {i : y i = 1} and I 2 = {i : y i = 1} Associated z 1 and z 2 are linearly dependent since z n = 0. SIR can estimate at most ONE direction.
9 Illustration to Wisconsin Diagnostic Breast Cancer Data y SIR SAVE x %*% b x %*% b x %*% b1 (a) SIR (Y vs. ˆb 1 X) (b) SAVE (ˆb 1 X vs. ˆb 2 X)
10 Outline Introduction 1 Introduction
11 Principal Support Vector Machine For (Y,X) R R p, PSVMs (Li et al., 2011; AOS) solve the following SVM-like problem: (a 0,c,b 0,c ) = argmin b } {{ Σb } λe [ 1 Ỹc(ab (X EX)) ] a,b }{{}. Var(b X) f(x) - Ỹ c = 1{Y c} 1{Y < c} for a given constant c. - Σ = cov(x) - [u] = max(0,u). Foundation of the PSVM Under linearity condition, b 0,c S Y X for any given c.
12 PSVM: Sample Estimation Given a set of data (X i,y i ),i = 1,,n: 1. For a given grid miny i < c 1 < < c H < maxy i, solve a sequence of PSVMs for different values of c h : (â n,h,ˆb n,h ) = argminb ˆΣn b λ a,b n 2. First k leading eigenvectors of estimate the basis set of S Y X. n [ 1 Ỹi,c h (ab (X i X ] n )) i=1 ˆM H L n = ˆb n,hˆb n,h. h=1.
13 PSVM: Remarks Pros: Outperforms SIR. Can be extended to kernel PSVM to handle nonlinear SDR. Cons: Estimates only one direction if Y is binary.
14 s Toward SDR with binary Y, WPSVM minimizes { [ ] } Λ π (θ) = β Σβ λe π(y) 1 Y{αβ (X EX)}. - θ = (α,β ) R R p. - π(y) = 1 π if Y = 1 and π otherwise for a given π (0,1). - Y itself is binary (no need Ỹc). θ 0,π = (α 0,π,β 0,π ) = argmin θ Λ π (θ). Foundation of the Weighted PSVM Under linearity condition, β 0,π S Y X for any given π (0,1).
15 Sample Estimation Given (X i,y i ) R p {1, 1},i= 1,,n: 1. For a given grid of π, 0 < π 1 < < π H < 1, solve a sequence of WPSVMs ˆΛ n,πh (θ) = β ˆΣn β λ n n π h (Y i )[1 Y i (β (X i X n )], i=1 and let ˆθ n,h = (ˆα n,h, ˆβ n,h ) = argmin θ ˆΛn,πh (θ). 2. First k leading eigenvectors of the WPSVM candidate matrix ˆM WL n = estimate the basis set of S Y X. H h=1 ˆβ n,hˆβ n,h
16 Computation Let η = U i = ˆΣ 1/2 n β. ˆΣ 1/2 n (X i X n). The WPSVM objective function ˆΛ n,πh (θ) becomes η η λ n n π h (Y i ) [ 1 Y i (αη U i ) ]. i=1 Equivalent to solve the linear WSVM w.r.t (U i,y i ). Solve WSVM H times for different weights of π h,h = 1,,H.
17 π-path Introduction Wang et al. (2008, Biometrika) show that the WSVM solutions move piecewise-linearly as a function of π. Shin et al. (2012, JCGS) implemented the π-path algorithm in R while developing a two-dimensional solution surface for weighted SVMs. Solution paths of the linear WPSVM β^n π
18 Asymptotic Results (1) Standard approach based on M-estimation scheme. Similar to the results for the linear SVM: - Koo et al., 2008; JMLR - Jiang et al., 2008; JMLR Consistency of ˆθ n Suppose Σ is positive definite, ˆθ n θ 0 in probability.
19 Asymptotic Results (2) Asymptotic Normality of ˆθ n (A Bahadur Representation) Under some regularity conditions to ensure the existence of both Gradient vector D θ and Hessian matrix H θ of Λ π (θ), where n(ˆθn θ 0 ) = n 1/2 H 1 θ 0 n D θ0 (Z i )o p (1), i=1 D θ (Z) = (0,2Σβ) λ[π(y) XY 1{θ XY < 1}] and H θ = 2diag(0,Σ) λ P(Y = y)π(y)f β X Y(y α y)e( X X θ X = y), y= 1,1 with X = (1,X ).
20 Asymptotic Results (3) For a given grid of π 1 < < π H, we define the population WPSVM kernel matrix M WL 0 = H β 0,h β 0,h. h=1 Asymptotic Normality of ˆMn Suppose rank(m WL 0 ) = k. Under the regularity conditions, } n WL {vec( ˆM n ) vec(mwl 0 ) N(0,Σ M ), where Σ M is explicitly provided. Asymptotic normality of eigenvectors of ˆMWL n normality of ˆMn. (Bura & Pfeiffer, 2008) is followed by the
21 Structure Dimensionality k Selection We estimate k as: ˆk = argmax k k {1,,p} j=1 υ j ρ klogn n υ 1, where υ 1 υ p are eigenvalues of ˆMn. Then lim P(ˆk = k) = 1. n
22 Outline Introduction 1 Introduction
23 Nonlinear SDR Nonlinear SDR assumes Y X φ(x). φ : R p R k is an arbitrary function of X which lives on H, a Hilbert space of functions of X. SDR is achieved by estimating φ.
24 Kernel WPSVM: Objective Function Kernel WPSVM objective function is Λ π (α,ψ) = var(ψ(x))λe{π(y)[1 Y(aψ(X) Eψ(X))] } Kernel WPSVM solves (α 0,π,ψ 0,π ) = argmin Λ π (α,ψ). α R,ψ H
25 Kernel WPSVM: Foundation Foundation of the Kernel WPSVM For a given π, ψ 0,π has a version that is σ{φ(x)}-measurable. Roughly speaking, ψ 0,π is a function of φ. It is a nonlinear-generalization of linear SDR: β 0,π S Y X = span(b) β 0,πX is a linear function of B X.
26 Kernel WPSVM: Sample Estimation Use Reproducing Kernel Hilbert Space. Using a linear operator Σ : ψ 1,Σψ 2 H = cov{ψ 1 (X),ψ 2 (X)}, Λ π (α,ψ) = ψ,σψ H λe{π(y)[1 Y(aψ(X) Eψ(X))] }. Li et al. (2011) proposed to use the first d leading eigenfunctions of the operator Σ n : H H such that as a basis set. ψ 1,Σ n ψ 2 H = cov n (ψ 1 (X),ψ 2 (X)), By proposition 2 in Li et al. (2011), ω j (X),j = 1,,d can be readily obtained by eigen-decomposition of (I n J n )K n (I n J n ). We chose d n/4.
27 Kernel WPSVM: Sample Estimation Sample version of Λ π (α,ψ) is ˆΛ n,π (α,γ) = γ Ω Ωγ λ n π(y i ) [ 1 Y i {αγ Ω i } ]. i=1 ω 1,,ω d be the first d leading eigenfunctions of the operator Σ n. Then, ω1(x 1 ) ωd (X 1) Ω = ω1 (X n) ωd (X n) where ω j (X) = ω j(x) n 1 n i=1 ω j(x i ).
28 Kernel WPSVM: Dual Problem Dual Formulation subject to ˆν = argmax ν 1,,ν n n ν i 1 4 i=1 n n i=1 j=1 ν i ν j Y i Y j P (i,j) Ω i) 0 ν i λπ(y i ),i = 1,,n ii) ii) n i=1 ν iy i = 0 where P (i,j) Ω is the (i,j)th element of P Ω = Ω(Ω Ω) 1 Ω. The kernel WPSVM solution is given by ˆγ n = λ n ˆν i Y i {(Ω Ω) 1 Ω i }. 2 i=1
29 Kernel WPSVM: 1. For a given gird π 1 < < π H, we compute a sequence of kernel WPSVM solutions: (ˆα n,h,ˆγ n,h ) = argmin α,γ 2. Corresponding kernel matrix is ˆΛ n,πh (α,γ). H ˆγ n,hˆγ n,h. (2) h=1 3. Let ˆV n = (ˆv 1,,v k ) denote the first k leading eigenvectors of (2), ˆφ(x) = ˆV n (ω 1(x),,ω d (x)).
30 Outline Introduction 1 Introduction
31 Simulation -Set Up X i = (X i1,,x ip ) N p (0,I), i = 1,,n where (n,p) = (500,10). We consider 5 Models: Model I: Y = sign{x 1/[0.5(X 2 1) 2 ]0.2ǫ}. Model II: Y = sign{(x 1 0.5)(X 2 0.5) 2 0.2ǫ}. Model III: Y = sign{sin(x 1)/e X 2 0.2ǫ}. Model IV: Y = sign{x 1(X 1 X 2 1)0.2ǫ}. Model V: Y = sign{(x 2 1 X 2 2) 1/2 log(x 2 1 X 2 2) 1/2 0.2ǫ}. B = (e 1,e 2 ) s.t. e i X = X i,i = 1,2 (k = 2). Performance is measured by PˆB P B F, where P A = A(A A) 1 A and F denotes Frobenius norm.
32 True Classification Function (c) Model IV (d) Model V Figure: Surface plots of the Model IV and V.
33 Results - Linear WPSVM Table: Averaged F-distance measures over 100 independent repetitions with associated standard deviations in parentheses. SAVE phd Fourier IHT LWPSVM I (.161) (.193) (.156) (.254) (.171) II (.187) (.186) (.214) (.199) (.198) III (.186) (.198) (.163) (.232) (.180) IV (.272) (.194) (.103) (.105) (.101) V (.052) (.053) (.241) (.011) (.171)
34 Results - Structure Dimensionality Model k n p = 10 p = 20 SAVE WPSVM SAVE WPSVM f f f f f f Table: Empirical probabilities (in percentage) of correctly estimating true k based on 100 independent repetitions. SAVE: the permutation test (Cook and Yin, 2001).
35 Results - Kernel WPSVM original predictors SAVE Linear WPSVM x b2 * x b2 * x x1 (a) Original b1 * x (b) SAVE b1 * x (c) Linear WPSVM Figure: Nonlinear SDR results for a random data set from Model V.
36 Results - Kernel WPSVM Kernel WPSVM φ^2(x) estimated φ^1(x) true (a) Kernel WPSVM(ˆφ 1 (X) vs. ˆφ2 (X)) (b) ˆφ 1 (X) vs. (X 2 1 X2 2 )1/2 Figure: Kernel WPSVM results for a random data set from Model V.
37 Results - Kernel WPSVM Two-sample Hotelling s T 2 test statistics: { ( Tn 2 = ( X X 1 ) ˆΣ n 1 )} 1 ( X n n X ). Table: Averaged T 2 n computed from the first two estimated sufficient predictors over 100 independent repetitions. Model SAVE phd FCN IHT LWPSVM KWPSVM IV (30.7) (20.6) (25.7) (24.5) (25.7) (71.9) V (1.2) (1.1) (4.2) (4.5) (4.6) (78.1)
38 WDBC data - SDR results Gn(M) ˆb 3 X ˆb 2 X φ^2(x) Structure Dimension ˆb 1 X φ^1(x) (a) k-selection (b) Linear WPSVM (c) Kernel WPSVM
39 WDBC data - Classification Accuracy 3-NN test error rate for the raw data: 7.7% (1.2) k SAVE phd FCN IHT LWPSVM KWPSVM (2.2) (4.5) (2.3) (3.1) (1.1) (2.0) (1.8) (4.8) (2.0) (2.8) (1.1) (2.0) (1.7) (5.0) (1.8) (1.5) (1.2) (2.0) (1.6) (4.6) (1.9) (1.4) (1.3) (1.9) (1.8) (4.2) (1.8) (1.3) (1.5) (1.9) Table: Averaged test error rates (in percentage) of the knn classifier (κ = 3) over 100 random partitions for the WDBC data with respect to the first k sufficient predictors (k = 1,2,3,4,5), which are estimated by different SDR methods. Corresponding standard deviations are given in parentheses.
40 Outline Introduction 1 Introduction
41 Introduction Most existing SDR methods suffer if Y is binary. The proposed WPSVM preserves all the merits of the PSVM and performs very well in binary classification. Computational efficiency can be improved by employing the π-path algorithm.
42 Thank you!!!
43 Selected References Li, K.-C. (1991), Sliced inverse regression for dimension reduction (with discussion), Journal of the American Statistical Association, 86, Li, B., Artemiou, A., and Li, L. (2011), Principal Support Vector Machines for Linear and Nonlinear Sufficient Dimension Reduction, Annals of Statistics, 39, Wang, J., Shen, X., and Liu, Y. (2008), Probability estimation for large-margin classifier, Biometrika, 95, Shin, S.J., Wu, Y., and Zhang, H.H. (2013), Two dimensional solution surface of the weighted support vector machines, Journal of Computational and Graphical Statistics, to appear. Koo, J.-Y., Lee, Y., Kim, Y., and Park, C. (2008) A Bahadur Representation of the Linear Support Vector Machine, Journal of Machine Learning Research, 9(Jul): Jiang, B., Zhang, X., and Cai, T. (2008) Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers, Journal of Machine Learning Research, 9(Mar): Bura, E. and Pfeiffer, C. (2008), On the distribution of the left singular vectors of a random matrix and its applications, Statistics and Probability Letters, 78,
44 Linearity Condition Linearity Condition For any a R p, E(a X B X) is a linear function of B X. E(X B X) = P Σ (B)X = B(B ΣB) 1 B ΣX Common and essential assumption in SDR. Hard to check since B is unknown. Holds if X is elliptically symmetric. (eg. X is multivariate normal) Approximately holds if p gets large for fixed d. (Hall and Li, 1993) Assumption is only for the marginal distribution of X.
45 ρ Selection 1. Randomly split the data into the training and testing sets. 2. Apply the WPSVM to the training set and compute its candidate tr matrix, M n. 3. For a given ρ, 3.a Compute ˆk tr = argmax k {1,,p} = G n(k;ρ, M tr n). tr tr 3.b Transform training predictors X j = ( V n) X tr j where V n tr = ( v 1 tr,, v tr ˆk tr ) are the first ˆk tr tr leading eigenvectors of M n. 3.c For each π h,h = 1,,H, apply the WSVM to {( X tr tr j,yj ) : j = 1,,n tr} to predict Y ts j. 3.d Denoting the predicted label Ŷ ts j, compute the total cost on the test data set. H n ts TC(ρ) = π h (Y ts ts j ) 1(Ŷj Y ts j ). h=1 j =1 4. Repeat 3.a d to select ρ which minimizes TC(ρ).
Sufficient Dimension Reduction using Support Vector Machine and it s variants
Sufficient Dimension Reduction using Support Vector Machine and it s variants Andreas Artemiou School of Mathematics, Cardiff University @AG DANK/BCS Meeting 2013 SDR PSVM Real Data Current Research and
More informationKernel-Based Contrast Functions for Sufficient Dimension Reduction
Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach
More informationLearning gradients: prescriptive models
Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan
More informationA Selective Review of Sufficient Dimension Reduction
A Selective Review of Sufficient Dimension Reduction Lexin Li Department of Statistics North Carolina State University Lexin Li (NCSU) Sufficient Dimension Reduction 1 / 19 Outline 1 General Framework
More informationDoes Modeling Lead to More Accurate Classification?
Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang
More informationA Study of Relative Efficiency and Robustness of Classification Methods
A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics
More informationA Least Squares Formulation for Canonical Correlation Analysis
A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University Motivation Canonical Correlation
More informationA Bahadur Representation of the Linear Support Vector Machine
A Bahadur Representation of the Linear Support Vector Machine Yoonkyung Lee Department of Statistics The Ohio State University October 7, 2008 Data Mining and Statistical Learning Study Group Outline Support
More informationSimulation study on using moment functions for sufficient dimension reduction
Michigan Technological University Digital Commons @ Michigan Tech Dissertations, Master's Theses and Master's Reports - Open Dissertations, Master's Theses and Master's Reports 2012 Simulation study on
More informationMaximum Likelihood, Logistic Regression, and Stochastic Gradient Training
Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationSufficient dimension reduction via distance covariance
Sufficient dimension reduction via distance covariance Wenhui Sheng Xiangrong Yin University of Georgia July 17, 2013 Outline 1 Sufficient dimension reduction 2 The model 3 Distance covariance 4 Methodology
More informationStructured Statistical Learning with Support Vector Machine for Feature Selection and Prediction
Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohio-state.edu/ yklee Predictive
More informationA Magiv CV Theory for Large-Margin Classifiers
A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector
More informationLecture 18: Multiclass Support Vector Machines
Fall, 2017 Outlines Overview of Multiclass Learning Traditional Methods for Multiclass Problems One-vs-rest approaches Pairwise approaches Recent development for Multiclass Problems Simultaneous Classification
More informationCombining eigenvalues and variation of eigenvectors for order determination
Combining eigenvalues and variation of eigenvectors for order determination Wei Luo and Bing Li City University of New York and Penn State University wei.luo@baruch.cuny.edu bing@stat.psu.edu 1 1 Introduction
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationCanonical Correlation Analysis of Longitudinal Data
Biometrics Section JSM 2008 Canonical Correlation Analysis of Longitudinal Data Jayesh Srivastava Dayanand N Naik Abstract Studying the relationship between two sets of variables is an important multivariate
More informationMachine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling
Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University
More informationDISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania
Submitted to the Annals of Statistics DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING By T. Tony Cai and Linjun Zhang University of Pennsylvania We would like to congratulate the
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationApproximate Kernel Methods
Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression
More informationLecture 16: Modern Classification (I) - Separating Hyperplanes
Lecture 16: Modern Classification (I) - Separating Hyperplanes Outline 1 2 Separating Hyperplane Binary SVM for Separable Case Bayes Rule for Binary Problems Consider the simplest case: two classes are
More informationApproximate Kernel PCA with Random Features
Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,
More informationSupervised Dimension Reduction:
Supervised Dimension Reduction: A Tale of Two Manifolds S. Mukherjee, K. Mao, F. Liang, Q. Wu, M. Maggioni, D-X. Zhou Department of Statistical Science Institute for Genome Sciences & Policy Department
More information6-1. Canonical Correlation Analysis
6-1. Canonical Correlation Analysis Canonical Correlatin analysis focuses on the correlation between a linear combination of the variable in one set and a linear combination of the variables in another
More informationEvaluation requires to define performance measures to be optimized
Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation
More informationSupport Vector Machines
Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)
More informationA Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data
A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohio-state.edu/ yklee May 13, 2005
More informationConvex Optimization in Classification Problems
New Trends in Optimization and Computational Algorithms December 9 13, 2001 Convex Optimization in Classification Problems Laurent El Ghaoui Department of EECS, UC Berkeley elghaoui@eecs.berkeley.edu 1
More informationFall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.
1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n
More informationChapter 9. Support Vector Machine. Yongdai Kim Seoul National University
Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved
More information9. Least squares data fitting
L. Vandenberghe EE133A (Spring 2017) 9. Least squares data fitting model fitting regression linear-in-parameters models time series examples validation least squares classification statistics interpretation
More informationEvaluation. Andrea Passerini Machine Learning. Evaluation
Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationMachine learning - HT Maximum Likelihood
Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce
More informationLINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning
LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary
More informationA GENERAL THEORY FOR NONLINEAR SUFFICIENT DIMENSION REDUCTION: FORMULATION AND ESTIMATION
The Pennsylvania State University The Graduate School Eberly College of Science A GENERAL THEORY FOR NONLINEAR SUFFICIENT DIMENSION REDUCTION: FORMULATION AND ESTIMATION A Dissertation in Statistics by
More informationKernel Methods. Machine Learning A W VO
Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance
More informationChapter 17: Undirected Graphical Models
Chapter 17: Undirected Graphical Models The Elements of Statistical Learning Biaobin Jiang Department of Biological Sciences Purdue University bjiang@purdue.edu October 30, 2014 Biaobin Jiang (Purdue)
More informationLecture 10: Support Vector Machine and Large Margin Classifier
Lecture 10: Support Vector Machine and Large Margin Classifier Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationRegression Graphics. R. D. Cook Department of Applied Statistics University of Minnesota St. Paul, MN 55108
Regression Graphics R. D. Cook Department of Applied Statistics University of Minnesota St. Paul, MN 55108 Abstract This article, which is based on an Interface tutorial, presents an overview of regression
More informationPartial factor modeling: predictor-dependent shrinkage for linear regression
modeling: predictor-dependent shrinkage for linear Richard Hahn, Carlos Carvalho and Sayan Mukherjee JASA 2013 Review by Esther Salazar Duke University December, 2013 Factor framework The factor framework
More informationBasis Expansion and Nonlinear SVM. Kai Yu
Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More information1 Machine Learning Concepts (16 points)
CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationSecond-Order Inference for Gaussian Random Curves
Second-Order Inference for Gaussian Random Curves With Application to DNA Minicircles Victor Panaretos David Kraus John Maddocks Ecole Polytechnique Fédérale de Lausanne Panaretos, Kraus, Maddocks (EPFL)
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationApproximate Dynamic Programming
Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.
More informationMultisurface Proximal Support Vector Machine Classification via Generalized Eigenvalues
Multisurface Proximal Support Vector Machine Classification via Generalized Eigenvalues O. L. Mangasarian and E. W. Wild Presented by: Jun Fang Multisurface Proximal Support Vector Machine Classification
More informationRegression Graphics. 1 Introduction. 2 The Central Subspace. R. D. Cook Department of Applied Statistics University of Minnesota St.
Regression Graphics R. D. Cook Department of Applied Statistics University of Minnesota St. Paul, MN 55108 Abstract This article, which is based on an Interface tutorial, presents an overview of regression
More informationLearning Task Grouping and Overlap in Multi-Task Learning
Learning Task Grouping and Overlap in Multi-Task Learning Abhishek Kumar Hal Daumé III Department of Computer Science University of Mayland, College Park 20 May 2013 Proceedings of the 29 th International
More informationSimple Linear Regression
Simple Linear Regression September 24, 2008 Reading HH 8, GIll 4 Simple Linear Regression p.1/20 Problem Data: Observe pairs (Y i,x i ),i = 1,...n Response or dependent variable Y Predictor or independent
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationMidterm: CS 6375 Spring 2015 Solutions
Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an
More informationSliced Inverse Moment Regression Using Weighted Chi-Squared Tests for Dimension Reduction
Sliced Inverse Moment Regression Using Weighted Chi-Squared Tests for Dimension Reduction Zhishen Ye a, Jie Yang,b,1 a Amgen Inc., Thousand Oaks, CA 91320-1799, USA b Department of Mathematics, Statistics,
More informationConvex Optimization M2
Convex Optimization M2 Lecture 8 A. d Aspremont. Convex Optimization M2. 1/57 Applications A. d Aspremont. Convex Optimization M2. 2/57 Outline Geometrical problems Approximation problems Combinatorial
More informationNotes on the Multivariate Normal and Related Topics
Version: July 10, 2013 Notes on the Multivariate Normal and Related Topics Let me refresh your memory about the distinctions between population and sample; parameters and statistics; population distributions
More informationLocalized Sliced Inverse Regression
Localized Sliced Inverse Regression Qiang Wu, Sayan Mukherjee Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University, Durham NC 2778-251,
More information5 Operations on Multiple Random Variables
EE360 Random Signal analysis Chapter 5: Operations on Multiple Random Variables 5 Operations on Multiple Random Variables Expected value of a function of r.v. s Two r.v. s: ḡ = E[g(X, Y )] = g(x, y)f X,Y
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationDeep Learning: Approximation of Functions by Composition
Deep Learning: Approximation of Functions by Composition Zuowei Shen Department of Mathematics National University of Singapore Outline 1 A brief introduction of approximation theory 2 Deep learning: approximation
More informationScaling up Vector Autoregressive Models with Operator Random Fourier Features
Scaling up Vector Autoregressive Models with Operator Random Fourier Features Romain Brault, Néhémy Lim, Florence d Alché-Buc August, 6 Abstract A nonparametric approach to Vector Autoregressive Modeling
More informationInternational Journal of Pure and Applied Mathematics Volume 19 No , A NOTE ON BETWEEN-GROUP PCA
International Journal of Pure and Applied Mathematics Volume 19 No. 3 2005, 359-366 A NOTE ON BETWEEN-GROUP PCA Anne-Laure Boulesteix Department of Statistics University of Munich Akademiestrasse 1, Munich,
More informationInfinitely Imbalanced Logistic Regression
p. 1/1 Infinitely Imbalanced Logistic Regression Art B. Owen Journal of Machine Learning Research, April 2007 Presenter: Ivo D. Shterev p. 2/1 Outline Motivation Introduction Numerical Examples Notation
More informationPrincipal components analysis COMS 4771
Principal components analysis COMS 4771 1. Representation learning Useful representations of data Representation learning: Given: raw feature vectors x 1, x 2,..., x n R d. Goal: learn a useful feature
More informationDensity estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas
0 0 5 Motivation: Regression discontinuity (Angrist&Pischke) Outcome.5 1 1.5 A. Linear E[Y 0i X i] 0.2.4.6.8 1 X Outcome.5 1 1.5 B. Nonlinear E[Y 0i X i] i 0.2.4.6.8 1 X utcome.5 1 1.5 C. Nonlinearity
More informationRegularized Least Squares
Regularized Least Squares Ryan M. Rifkin Google, Inc. 2008 Basics: Data Data points S = {(X 1, Y 1 ),...,(X n, Y n )}. We let X simultaneously refer to the set {X 1,...,X n } and to the n by d matrix whose
More informationMath for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han
Math for Machine Learning Open Doors to Data Science and Artificial Intelligence Richard Han Copyright 05 Richard Han All rights reserved. CONTENTS PREFACE... - INTRODUCTION... LINEAR REGRESSION... 4 LINEAR
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More informationIntroduction to Machine Learning
1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer
More informationMachine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015
Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary
More informationIntroduction to Machine Learning
1, DATA11002 Introduction to Machine Learning Lecturer: Antti Ukkonen TAs: Saska Dönges and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer,
More informationc 4, < y 2, 1 0, otherwise,
Fundamentals of Big Data Analytics Univ.-Prof. Dr. rer. nat. Rudolf Mathar Problem. Probability theory: The outcome of an experiment is described by three events A, B and C. The probabilities Pr(A) =,
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters
More informationNotes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces
Notes on the framework of Ando and Zhang (2005 Karl Stratos 1 Beyond learning good functions: learning good spaces 1.1 A single binary classification problem Let X denote the problem domain. Suppose we
More informationConvex Optimization and Modeling
Convex Optimization and Modeling Introduction and a quick repetition of analysis/linear algebra First lecture, 12.04.2010 Jun.-Prof. Matthias Hein Organization of the lecture Advanced course, 2+2 hours,
More informationKernels A Machine Learning Overview
Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter
More informationChapter 3: Maximum Likelihood Theory
Chapter 3: Maximum Likelihood Theory Florian Pelgrin HEC September-December, 2010 Florian Pelgrin (HEC) Maximum Likelihood Theory September-December, 2010 1 / 40 1 Introduction Example 2 Maximum likelihood
More informationDimension reduction techniques for classification Milan, May 2003
Dimension reduction techniques for classification Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Dimension reduction techniquesfor classification p.1/53
More informationThis paper is not to be removed from the Examination Halls
~~ST104B ZA d0 This paper is not to be removed from the Examination Halls UNIVERSITY OF LONDON ST104B ZB BSc degrees and Diplomas for Graduates in Economics, Management, Finance and the Social Sciences,
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More information10-725/36-725: Convex Optimization Prerequisite Topics
10-725/36-725: Convex Optimization Prerequisite Topics February 3, 2015 This is meant to be a brief, informal refresher of some topics that will form building blocks in this course. The content of the
More informationA Bias Correction for the Minimum Error Rate in Cross-validation
A Bias Correction for the Minimum Error Rate in Cross-validation Ryan J. Tibshirani Robert Tibshirani Abstract Tuning parameters in supervised learning problems are often estimated by cross-validation.
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationRandom Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.
Random Forests One of the best known classifiers is the random forest. It is very simple and effective but there is still a large gap between theory and practice. Basically, a random forest is an average
More informationClassification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).
Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes
More informationLINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input
More informationKernel-based Feature Extraction under Maximum Margin Criterion
Kernel-based Feature Extraction under Maximum Margin Criterion Jiangping Wang, Jieyan Fan, Huanghuang Li, and Dapeng Wu 1 Department of Electrical and Computer Engineering, University of Florida, Gainesville,
More informationCross-Validation with Confidence
Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University UMN Statistics Seminar, Mar 30, 2017 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation
More informationSufficient Dimension Reduction for Longitudinally Measured Predictors
Sufficient Dimension Reduction for Longitudinally Measured Predictors Ruth Pfeiffer National Cancer Institute, NIH, HHS joint work with Efstathia Bura and Wei Wang TU Wien and GWU University JSM Vancouver
More informationStatistical Machine Learning
Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x
More informationML (cont.): SUPPORT VECTOR MACHINES
ML (cont.): SUPPORT VECTOR MACHINES CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 40 Support Vector Machines (SVMs) The No-Math Version
More informationSupport Vector Machine
Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary
More information