Supplementary material: Margin based PU Learning. Matrix Concentration Inequalities

Similar documents
Lecture 4: September 12

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Matrix Approximation via Sampling, Subspace Embedding. 1 Solving Linear Systems Using SVD

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

HANSON-WRIGHT INEQUALITY AND SUB-GAUSSIAN CONCENTRATION

MATH 5707 HOMEWORK 4 SOLUTIONS 2. 2 i 2p i E(X i ) + E(Xi 2 ) ä i=1. i=1

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Supplement to Clustering with Statistical Error Control

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Eigenvalues of Random Graphs

Multi-dimensional Central Limit Theorem

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

Differentiating Gaussian Processes

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Vapnik-Chervonenkis theory

Computing MLE Bias Empirically

Composite Hypotheses testing

THE ARIMOTO-BLAHUT ALGORITHM FOR COMPUTATION OF CHANNEL CAPACITY. William A. Pearlman. References: S. Arimoto - IEEE Trans. Inform. Thy., Jan.

Estimation: Part 2. Chapter GREG estimation

Strong Markov property: Same assertion holds for stopping times τ.

Lecture 4. Instructor: Haipeng Luo

Goodness of fit and Wilks theorem

Large-Scale Data-Dependent Kernel Approximation Appendix

Yong Joon Ryang. 1. Introduction Consider the multicommodity transportation problem with convex quadratic cost function. 1 2 (x x0 ) T Q(x x 0 )

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

p(z) = 1 a e z/a 1(z 0) yi a i x (1/a) exp y i a i x a i=1 n i=1 (y i a i x) inf 1 (y Ax) inf Ax y (1 ν) y if A (1 ν) = 0 otherwise

Econometrica Supplementary Material

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Notes on Frequency Estimation in Data Streams

Applied Stochastic Processes

The Order Relation and Trace Inequalities for. Hermitian Operators

Lecture 3. Ax x i a i. i i

1 Convex Optimization

HANSON-WRIGHT INEQUALITY AND SUB-GAUSSIAN CONCENTRATION

MATH 281A: Homework #6

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

TAIL BOUNDS FOR SUMS OF GEOMETRIC AND EXPONENTIAL VARIABLES

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora

Complete subgraphs in multipartite graphs

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

CSCE 790S Background Results

Transfer Functions. Convenient representation of a linear, dynamic model. A transfer function (TF) relates one input and one output: ( ) system

Large Sample Properties of Matching Estimators for Average Treatment Effects by Alberto Abadie & Guido Imbens

APPENDIX A Some Linear Algebra

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

Natural Images, Gaussian Mixtures and Dead Leaves Supplementary Material

Salmon: Lectures on partial differential equations. Consider the general linear, second-order PDE in the form. ,x 2

2.3 Nilpotent endomorphisms

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Chapter 7 Generalized and Weighted Least Squares Estimation. In this method, the deviation between the observed and expected values of

e - c o m p a n i o n

Lecture 10 Support Vector Machines II

Gaussian Conditional Random Field Network for Semantic Segmentation - Supplementary Material

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

C/CS/Phy191 Problem Set 3 Solutions Out: Oct 1, 2008., where ( 00. ), so the overall state of the system is ) ( ( ( ( 00 ± 11 ), Φ ± = 1

Lecture 3: Probability Distributions

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Appendix for Causal Interaction in Factorial Experiments: Application to Conjoint Analysis

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

BOUNDEDNESS OF THE RIESZ TRANSFORM WITH MATRIX A 2 WEIGHTS

Lecture 12: Discrete Laplacian

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Effects of Ignoring Correlations When Computing Sample Chi-Square. John W. Fowler February 26, 2012

More metrics on cartesian products

Dimensionality Reduction Notes 1

As is less than , there is insufficient evidence to reject H 0 at the 5% level. The data may be modelled by Po(2).

Statistics and Probability Theory in Civil, Surveying and Environmental Engineering

Solutions Homework 4 March 5, 2018

Tail Dependence Comparison of Survival Marshall-Olkin Copulas

Conjugacy and the Exponential Family

9 Characteristic classes

SELECTED PROOFS. DeMorgan s formulas: The first one is clear from Venn diagram, or the following truth table:

1 Definition of Rademacher Complexity

Convergence of random processes

10-801: Advanced Optimization and Randomized Methods Lecture 2: Convex functions (Jan 15, 2014)

The lower and upper bounds on Perron root of nonnegative irreducible matrices

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Min Cut, Fast Cut, Polynomial Identities

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Expected Value and Variance

Linear Approximation with Regularization and Moving Least Squares

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Error Probability for M Signals

Bézier curves. Michael S. Floater. September 10, These notes provide an introduction to Bézier curves. i=0

Randić Energy and Randić Estrada Index of a Graph

Edge Isoperimetric Inequalities

The exam is closed book, closed notes except your one-page cheat sheet.

EEE 241: Linear Systems

Basic Statistical Analysis and Yield Calculations

Lecture 6/7 (February 10/12, 2014) DIRAC EQUATION. The non-relativistic Schrödinger equation was obtained by noting that the Hamiltonian 2

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Relevance Vector Machines Explained

Outline. Multivariate Parametric Methods. Multivariate Data. Basic Multivariate Statistics. Steven J Zeil

Transcription:

Supplementary materal: Margn based PU Learnng We gve the complete proofs of Theorem and n Secton We frst ntroduce the well-known concentraton nequalty, so the covarance estmator can be bounded Then we analyze the convergence of PMPU Matrx Concentraton Inequaltes Lemma Matrx Bernsten s nequalty Consder a fnte sequence {S } of ndependent random matrces of dmenson d d Assume that each matrx has unformly bounded devaton from ts mean: S ES L for each ndex Introduce the random matrx Z S, and let νz be the matrx varance of Z where νz max{ez EZZ EZ, EZ EZ Z EZ} max{ ES ES S ES, ES ES S ES } Then EZ EZ νz logd + d + 3 L logd + d Furthermore, for all t >, { P{Z EZ t} d + d exp t / νz + Lt/3 Wth matrx Bernsten s nequalty, t s standard to get the concentraton of covarance estmaton: Proposton Suppose {x } N Rd are ndependent and dentcal dstrbuted d sub-gaussan random vectors and X [x, x,, x N ], then wth probablty at least δ, provded N C δ d logd/ϵ N XX I ϵ Lemma Let X [x, x,, x N ] R d N Suppose each x s are ndependently sampled from the truncated Gaussan dstrbuton wth postve margn, then for w R d wth w, we have } where λ + exp Esgn x, w x λ w, Proof It s well known that when x s the standard Gaussan random varable, λ In our settng, the st dmenson of x s a truncated Gaussan, hence Esgn x, w x E x w + exp w Lemma 3 Let g [g, g,, g d ], g be a truncated Gaussan random varable, and the remanng d dmensons are d from standard Gaussan dstrbuton For two dfferent vectors w, w R d, f arccos w, w, we have [ Egg sgn g, w sgn g, w C d + e + ] w w [ Eg sgn g, w sgn g, w C d + e + ] w w

snce α <, we have c α + sn α 3c α Proof Defne α arccos w, w and w, w We wll prove the two nequaltes under the condton α a Snce Egg sgn g, w sgn g, w g g g g g d g g g g g d E sgn g, w sgn g, w, g d g g d g g d g d we need to estmate each Eg g j sgn g, w sgn g, w Observe that only when g > g cos α + g sn α < or g < g cos α + g sn α >, sgn g, w sgn g, w Otherwse t s Hence, the doman of the expectaton s Ω {g, g : g > g cos α + g sn α < } {g < g cos α + g sn α > } wth all other Gaussan varables g 3,, g d, For j, j [d], For j, Eg sgn g, w sgn g, w + exp erf Eg sgn g, w sgn g, w gϕg ϕg dg dg ϕg 3 ϕg d dg 3 dg d g 3,,g d 8 g,g Ω g,g Ω /+α / g ϕg ϕg dg dg c α + sn α, sn θe r r 3 drdθ by polar transformaton For j 3, we have For, j 3,, d or j, 3,, d, we get Eg sgn g, w sgn g, w α Eg g j sgn g, w sgn g, w Eg Eg g 3,,g d ϕg 3 ϕg d dg 3 dg d + exp / 8 + exp / For all the other cases that j, we can see that Eg g j sgn g, w sgn g, w

Therefore, Egg sgn g, w sgn g, w + exp erf 8 + exp / 8 + exp / 8 8 + exp / c α + sn α 8 + exp / α 8 + exp / α max { + exp + 8 + d + exp /, } 3c α + 8 + exp /, α + 8 + exp / C d exp + + w w, n whch the frst nequalty holds because A A A b: The proof s smlar to that of a We have Eg sgn g, w sgn g, w g sgn g, w sgn g, w + exp C d exp + + 3c α + d α + + w w + exp / Proof of Theorem Proof Accordng to the rotaton nvarance of the Eucldean space, there exsts a rotaton matrx Q such that Q w [,,, ] Wthout loss of generalty, we assume that w [,,, ] R d For smplcty, we wll dscard the superscrpt t n X t but the reader should aware that the feature matrx X s always re-sampled n each teraton Let x [x, x ] where x denotes the st dmenson of x and x denotes the remanng d dmenson Smlarly, we denote w [w, w ] Denote by y y ŷ the ntal error Snce at the t-th teraton, w t w t X y t λ m t we have w t λ m t Xy t ŷ t, w t w w t w λ m t XsgnX w t sgnx w + sgnx w S X w t w t w λ m t XsgnX w t sgnx w + λ m t X t where t sgnx w S X w t To bound the frst two terms, usng Lemma and Lemma 3, we have wth probablty at least δ, w t w λ m t XsgnX w t sgnx w ϵ maxw t w, w t w /

provded m t Od log d exp //ϵ As we assume m s suffcently large, t s easy to satsfy that w t w Then w t w λ m t XsgnX w t sgnx w ϵ Next let us frst consder It s clear that wth probablty at least δ, d logd X C δ + E[sgnx w S x w ] m t m t The estmaton of ncludes two cases, e E + and E where E + s the error on {x w η z > } and E s the error on {x w > η z < }, where z x w Denote the cumulatve dstrbuton functon of standard Gaussan dstrbuton by We obtan E + Φz : z e t dt P x w + x w < η, x α dα e α η αw Φ dα w η αw [ e α + erf w e α αw η erfc w e α αw η e w dα w e η w +w w + w erfc w e η w η erfc w dα ] dα w + w w η w w + w where erfz x dx denotes the error functon and erfcz erfz s the complementary error functon The e x -th equalty holds because cumulatve functon Φz z + erf Smlarly, we have E P x w + x w η, x β dβ β e β e e η βw erfc w η βw w dβ dβ w e η w +w w η w + w erfc w w + w w e η erfc, w η w

Then, E[sgnx w S x w ] E + + E w e η a w [ w η w η ] erfc + erfc w w [ + w η w exp η w w + exp η ] w b ĉ [exp c + δ m ]w w c exp c w w b s a smplfcaton of a The constant ĉ and c actually depend on and many other factors However once we fxed the parameters, they wll be constants and do not control the order of our bound δ m s a small number f m s large because when m s large w and w As we always assume m s suffcently large, δ m < due to the exponental decayng Smlarly the upper bound of the error at the t-th step s E[sgnx w S x w t ] c exp c w t w Combne everythng above, we have wth probablty at least δ, w t w d logd ϵ + C δ + c exp c w t w m t As m t s sampled on unlabeled dataset, t can be as large as we want Therefore the above nequalty can be smplfed when m t s suffcently large, that s, w t w c exp c w t w Proof of Theorem Proof Let B λ [x sgn x, w x sgn x, w ], then by Lemma, we have Further, we set where EB w w Z B EB, m B [Xsgn X, w Xsgn X, w ] λ In order to utlze matrx Bernsten nequalty, we need to bound the terms max Z, EZ Z and EZ Z respectvely For the frst term, we have max Z max B EB max max When w w s suffcent small, then B + EB λ x sgn x, w x sgn x, w + w w d λ + w w d + w w c d λ λ

For the second term, we get EZ Z EB EB B EB Snce and EB B B EB EB B + EB EB EB B + EB EB EB EB w w, Thus, we have EB B λ Ex x sgn x, w sgn x, w C d λ w w EZ Z C d λ w w + w w Note that f w w <, then w w > w w, and w w, then w w w w Hence, the above nequalty can be rewrtten as For the thrd term, we have EZ Z C d λ EZ Z max{w w, w w } Snce and EB EB B EB EB B + EB EB EB EB w w, Then, we derve whch can be rewrtten as EB B λ Ex sgn x, w sgn x, w C λ w w, by Lemma 3 EZ Z C λ w w + w w, EZ Z C λ max{w w, w w } Now we can apply matrx Bernsten nequalty to obtan the fnal result