Distribution-free inference for estimation and classification

Distribution-free inference for estimation and classification Rina Foygel Barber (joint work with Fan Yang) http://www.stat.uchicago.edu/~rina/

Inference without assumptions? Training data (X 1, Y 1 ),..., (X n, Y n ) Fit a parametric model, e.g. Y β 1 X (1) + + β d X (d) & then build confidence intervals for E [Y X = x] or prediction intervals for Y X = x Assume an incorrect model Underestimate uncertainty Distribution-free inference w/o assuming model is correct 2/22

Prediction via conformal inference Inspiration: Distribution-Free Predictive Inference For Regression Lei, G Sell, Rinaldo, Tibshirani, Wasserman (2016) Exchangeable data: (X 1, Y 1 ),..., (X n, Y n ), (X n+1,???) Prediction interval for Y n+1 under no assumptions, using any regression method Coverage holds under exchangeability of the training data & test point 3/22

Inference for estimation & classification If E [Y X] = P (X): Fitted regression function P (X) If Y is binary, then prediction interval is meaningless always given by {0, 1} More generally, if noise is high, prediction interval is very wide Can we build a confidence interval for P (X), with no assumptions? 4/22

Inference for estimation & classification Our plan: 1 Given estimate P (x), and fresh data (X 1, Y 1 ),..., (X n, Y n ), how can we build a confidence band for P (x)? 2 How can we use a single data set to both construct P (x), and build confidence band? 5/22

Confidence band for P (x) Given estimate P (x), and fresh data (X 1, Y 1 ),..., (X n, Y n ), how can we build a confidence band for P (x)? 6/22

Confidence band for P (x) Given estimate P (x), and fresh data (X 1, Y 1 ),..., (X n, Y n ), how can we build a confidence band for P (x)? WLOG reorder indices so that P (X 1 ) P (X n ) 1.0 0.8 Fitted prob True prob True prob (smoothed) Prob { Y=1 X } 0.6 0.4 0.2 0.0 0 50 100 150 200 Index (sorted) 6/22

Isotonic regression 1 Calibration via isotonic regression: { } ( p 1,..., p n ) = arg min (Y i p i ) 2 : p 1 p n 2 Best possible outcome: { } ( p 1,..., p n ) = arg min (P (X i ) p i ) 2 : p 1 p n 3 Build confidence band for ( p 1,..., p n )? i i 7/22

Isotonic regression 1.0 0.8 Fitted prob Calibrated estimate True prob True prob (isotonic) Prob { Y=1 X } 0.6 0.4 0.2 0.0 0 50 100 150 200 Index (sorted) 8/22

Isotonic regression Known results: if Y i = P (X i ) + subgaussian noise, p i p i n 1/3 or n 1/2, with exponent depending on local properties of the true P (X i ) s. If P (X i ) is locally strictly increasing, n 1/3 rate If P (X i ) is locally constant, n 1/2 rate Chatterjee et al 2015; Cator 2011; many others Our goal: construct a data-adaptive bound on p i p i, that does not depend on knowing properties of the true means 9/22

A geometric approach Isotonic regression = projection: p = P iso (Y ), p = P iso (P (X)), where P iso is projection to the isotonic cone, {t : t 1 t n }. Convex projection p p 2 2 Y P (X) 2 2 Y P (X) 2 2 n, so at best, confidence interval width 1 10/22

A geometric approach Can we use a different norm? Theorem: contraction via isotonic projection For any norm, P iso (u) P iso (v) u v for all u, v if and only if is nonincreasing under neighbor averaging, ( u u 1,..., u i 1, u i + u i+1, u ) i + u i+1, u i+1,..., u n u. 2 2 11/22

A geometric approach Sliding window norm: u SW = max j i + 1 ui:j 1 i j n If Y i = P (X i ) + subgaussian noise, j i + 1 (Y P (X))i:j is subgaussian Y P (X) SW log(n). Since SW is contractive by our theorem, p p SW log(n) 12/22

Data-adaptive bands Know: p p SW log(n), and p, p are both monotonic. log(n) p i p i:j p i:j + for any j i j i + 1 by taking a minimum over all j, we find a data-adaptive bound If p i is locally strictly increasing minimum achieved at j i n 2/3, with n 1/3 rate If p i is locally constant minimum achieved at j i n, with n 1/2 rate 13/22

Data-adaptive bands 0 200 400 600 800 1000 15 10 5 0 5 10 15 Index i Data y i Estimate iso(y) i Confidence band 14/22

Data-adaptive bands 4.70 4.75 4.80 4.85 4.90 4.95 0.0 0.1 0.2 0.3 Convergence in flat regions log( n/log(n) ) log( mean confidence band width ) Least squares regression line (Slope = 0.5178 ) 4.70 4.75 4.80 4.85 4.90 4.95 1.22 1.24 1.26 1.28 1.30 1.32 Convergence in increasing region log( n/log(n) ) log( mean confidence band width ) Least squares regression line (Slope = 0.3259 ) 15/22

Data-adaptive bands Summary new features of our method: Data-adaptive band, don t need to know properties of P (X i ) Do need a bound σ on noise level for binary data, use σ = 1, otherwise can estimate with σ 2 = Y p 2 2 n ( effective d.f. ) (Meyer & Woodroofe 2000) Confidence band contains the entire function too conservative at any single point 16/22

Reusing data classification Goal: using a single data set, Estimate a regression function, P (x) P (x) = E [Y X = x] And, build a confidence band containing the true P (x) (at most points x) This is a selective inference problem choosing model, & assessing its accuracy, using a single data set 17/22

Reusing data classification How can we use a single data set to both construct P (x), and build confidence band? 1 Randomize the data by flipping some of the Y i s: Y i Y i, with probability 1 φ, = 1 Y i, with probability φ flip probability. 18/22

Reusing data classification How can we use a single data set to both construct P (x), and build confidence band? 2 Fit regression function P (x) using blurred data (X 1, Y 1 ),..., (X n, Y n) Distribution of Y X: P {Y i = 1 X i } = P (X i ) (1 φ) + (1 P (X i )) φ 19/22

Reusing data classification How can we use a single data set to both construct P (x), and build confidence band? 3 Then perform inference for P (X) by revealing original data Y : Distribution of Y X, Y : P {Y i = 1 X i, Y i P (X i ) (1 φ) = 1} = P (X i ) (1 φ) + (1 P (X i )) φ P {Y i = 1 X i, Y i P (X i ) φ = 0} = P (X i ) φ + (1 P (X i )) (1 φ) 20/22

Summary inference for estimation Isotonic regression + fresh data distribution-free confidence bands Possible to fit model + perform inference on single data set? (Distribution-free selective inference?) 21/22

Thank you! Website: http://www.stat.uchicago.edu/~rina/ Preprint: http://arxiv.org/abs/1706.01852 Thanks to funding from NSF & Sloan Fellowship 22/22