Distribution-free inference for estimation and classification

Similar documents
Estimation of a Two-component Mixture Model

Optimization and Gradient Descent

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

Accumulation test demo - simulated data

Maximum likelihood estimation of a log-concave density based on censored data

Mathematical Formulation of Our Example

Matrix Support Functional and its Applications

Recap from previous lecture

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Data Mining und Maschinelles Lernen

Classification objectives COMS 4771

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Distribution-Free Distribution Regression

Example 1: Inverse Functions Show that the functions are inverse functions of each other (if they are inverses, )

Probability Theory for Machine Learning. Chris Cremer September 2015

Advanced Introduction to Machine Learning CMU-10715

DS-GA 1002 Lecture notes 12 Fall Linear regression

2. Outliers and inference for regression

Gaussian Processes (10/16/13)

Miscellany : Long Run Behavior of Bayesian Methods; Bayesian Experimental Design (Lecture 4)

The Geometry of Hypothesis Testing over Convex Cones

Based on slides by Richard Zemel

What happens when there are many agents? Threre are two problems:

Rank-Based Methods. Lukas Meier

An Introduction to Machine Learning

Mathematical Optimization Models and Applications

Warm up: risk prediction with logistic regression

Machine Learning Algorithm. Heejun Kim

Estimators based on non-convex programs: Statistical and computational guarantees

MS&E 226: Small Data

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Bayesian Methods: Naïve Bayes

Jointly Clustering Rows and Columns of Binary Matrices: Algorithms and Trade-offs

Day 5: Generative models, structured classification

High dimensional Ising model selection

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Decision-making, inference, and learning theory. ECE 830 & CS 761, Spring 2016

An Introduction to Statistical and Probabilistic Linear Models

MATH 680 Fall November 27, Homework 3

Inference For High Dimensional M-estimates. Fixed Design Results

Testing against a parametric regression function using ideas from shape restricted estimation

Markov Random Fields

Computing regularization paths for learning multiple kernels

Statistical Methods for Data Mining

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Gaussian with mean ( µ ) and standard deviation ( σ)

Notes on Markov Networks

Tufts COMP 135: Introduction to Machine Learning

Testing System Conformance for Cyber-Physical Systems

Areal Unit Data Regular or Irregular Grids or Lattices Large Point-referenced Datasets

Prediction of Data with help of the Gaussian Process Method

Post-Selection Inference

A Significance Test for the Lasso

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Notes on Discriminant Functions and Optimal Classification

Active and Semi-supervised Kernel Classification

Lecture : Probabilistic Machine Learning

Variable Selection in High Dimensional Convex Regression

Short Course Robust Optimization and Machine Learning. Lecture 6: Robust Optimization in Machine Learning

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

A knockoff filter for high-dimensional selective inference

Distribution-Free Predictive Inference for Regression

Lecture Support Vector Machine (SVM) Classifiers

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Lecture 30. DATA 8 Summer Regression Inference

Nonparametric Bayesian Methods (Gaussian Processes)

Machine Learning Linear Models

Causal Inference Basics

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Permutation-invariant regularization of large covariance matrices. Liza Levina

Linear Regression with one Regressor

Stephen Scott.

Day 3: Classification, logistic regression

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

1-Bit Matrix Completion

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

MAP Examples. Sargur Srihari

Computational and Statistical Tradeoffs via Convex Relaxation

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Selection-adjusted estimation of effect sizes

Machine Learning Practice Page 2 of 2 10/28/13

DATA MINING AND MACHINE LEARNING

Statistics: Learning models from data

Linear and Logistic Regression. Dr. Xiaowei Huang

STAT 518 Intro Student Presentation

Model Selection, Estimation, and Bootstrap Smoothing. Bradley Efron Stanford University

6.867 Machine learning

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

Support Vector Machines: Maximum Margin Classifiers

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Dose-response modeling with bivariate binary data under model uncertainty

Expected Shortfall is not elicitable so what?

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Expected Shortfall is not elicitable so what?

Machine Learning 4771

Transcription:

Distribution-free inference for estimation and classification Rina Foygel Barber (joint work with Fan Yang) http://www.stat.uchicago.edu/~rina/

Inference without assumptions? Training data (X 1, Y 1 ),..., (X n, Y n ) Fit a parametric model, e.g. Y β 1 X (1) + + β d X (d) & then build confidence intervals for E [Y X = x] or prediction intervals for Y X = x Assume an incorrect model Underestimate uncertainty Distribution-free inference w/o assuming model is correct 2/22

Prediction via conformal inference Inspiration: Distribution-Free Predictive Inference For Regression Lei, G Sell, Rinaldo, Tibshirani, Wasserman (2016) Exchangeable data: (X 1, Y 1 ),..., (X n, Y n ), (X n+1,???) Prediction interval for Y n+1 under no assumptions, using any regression method Coverage holds under exchangeability of the training data & test point 3/22

Inference for estimation & classification If E [Y X] = P (X): Fitted regression function P (X) If Y is binary, then prediction interval is meaningless always given by {0, 1} More generally, if noise is high, prediction interval is very wide Can we build a confidence interval for P (X), with no assumptions? 4/22

Inference for estimation & classification Our plan: 1 Given estimate P (x), and fresh data (X 1, Y 1 ),..., (X n, Y n ), how can we build a confidence band for P (x)? 2 How can we use a single data set to both construct P (x), and build confidence band? 5/22

Confidence band for P (x) Given estimate P (x), and fresh data (X 1, Y 1 ),..., (X n, Y n ), how can we build a confidence band for P (x)? 6/22

Confidence band for P (x) Given estimate P (x), and fresh data (X 1, Y 1 ),..., (X n, Y n ), how can we build a confidence band for P (x)? WLOG reorder indices so that P (X 1 ) P (X n ) 1.0 0.8 Fitted prob True prob True prob (smoothed) Prob { Y=1 X } 0.6 0.4 0.2 0.0 0 50 100 150 200 Index (sorted) 6/22

Isotonic regression 1 Calibration via isotonic regression: { } ( p 1,..., p n ) = arg min (Y i p i ) 2 : p 1 p n 2 Best possible outcome: { } ( p 1,..., p n ) = arg min (P (X i ) p i ) 2 : p 1 p n 3 Build confidence band for ( p 1,..., p n )? i i 7/22

Isotonic regression 1.0 0.8 Fitted prob Calibrated estimate True prob True prob (isotonic) Prob { Y=1 X } 0.6 0.4 0.2 0.0 0 50 100 150 200 Index (sorted) 8/22

Isotonic regression Known results: if Y i = P (X i ) + subgaussian noise, p i p i n 1/3 or n 1/2, with exponent depending on local properties of the true P (X i ) s. If P (X i ) is locally strictly increasing, n 1/3 rate If P (X i ) is locally constant, n 1/2 rate Chatterjee et al 2015; Cator 2011; many others Our goal: construct a data-adaptive bound on p i p i, that does not depend on knowing properties of the true means 9/22

A geometric approach Isotonic regression = projection: p = P iso (Y ), p = P iso (P (X)), where P iso is projection to the isotonic cone, {t : t 1 t n }. Convex projection p p 2 2 Y P (X) 2 2 Y P (X) 2 2 n, so at best, confidence interval width 1 10/22

A geometric approach Can we use a different norm? Theorem: contraction via isotonic projection For any norm, P iso (u) P iso (v) u v for all u, v if and only if is nonincreasing under neighbor averaging, ( u u 1,..., u i 1, u i + u i+1, u ) i + u i+1, u i+1,..., u n u. 2 2 11/22

A geometric approach Sliding window norm: u SW = max j i + 1 ui:j 1 i j n If Y i = P (X i ) + subgaussian noise, j i + 1 (Y P (X))i:j is subgaussian Y P (X) SW log(n). Since SW is contractive by our theorem, p p SW log(n) 12/22

Data-adaptive bands Know: p p SW log(n), and p, p are both monotonic. log(n) p i p i:j p i:j + for any j i j i + 1 by taking a minimum over all j, we find a data-adaptive bound If p i is locally strictly increasing minimum achieved at j i n 2/3, with n 1/3 rate If p i is locally constant minimum achieved at j i n, with n 1/2 rate 13/22

Data-adaptive bands 0 200 400 600 800 1000 15 10 5 0 5 10 15 Index i Data y i Estimate iso(y) i Confidence band 14/22

Data-adaptive bands 4.70 4.75 4.80 4.85 4.90 4.95 0.0 0.1 0.2 0.3 Convergence in flat regions log( n/log(n) ) log( mean confidence band width ) Least squares regression line (Slope = 0.5178 ) 4.70 4.75 4.80 4.85 4.90 4.95 1.22 1.24 1.26 1.28 1.30 1.32 Convergence in increasing region log( n/log(n) ) log( mean confidence band width ) Least squares regression line (Slope = 0.3259 ) 15/22

Data-adaptive bands Summary new features of our method: Data-adaptive band, don t need to know properties of P (X i ) Do need a bound σ on noise level for binary data, use σ = 1, otherwise can estimate with σ 2 = Y p 2 2 n ( effective d.f. ) (Meyer & Woodroofe 2000) Confidence band contains the entire function too conservative at any single point 16/22

Reusing data classification Goal: using a single data set, Estimate a regression function, P (x) P (x) = E [Y X = x] And, build a confidence band containing the true P (x) (at most points x) This is a selective inference problem choosing model, & assessing its accuracy, using a single data set 17/22

Reusing data classification How can we use a single data set to both construct P (x), and build confidence band? 1 Randomize the data by flipping some of the Y i s: Y i Y i, with probability 1 φ, = 1 Y i, with probability φ flip probability. 18/22

Reusing data classification How can we use a single data set to both construct P (x), and build confidence band? 2 Fit regression function P (x) using blurred data (X 1, Y 1 ),..., (X n, Y n) Distribution of Y X: P {Y i = 1 X i } = P (X i ) (1 φ) + (1 P (X i )) φ 19/22

Reusing data classification How can we use a single data set to both construct P (x), and build confidence band? 3 Then perform inference for P (X) by revealing original data Y : Distribution of Y X, Y : P {Y i = 1 X i, Y i P (X i ) (1 φ) = 1} = P (X i ) (1 φ) + (1 P (X i )) φ P {Y i = 1 X i, Y i P (X i ) φ = 0} = P (X i ) φ + (1 P (X i )) (1 φ) 20/22

Summary inference for estimation Isotonic regression + fresh data distribution-free confidence bands Possible to fit model + perform inference on single data set? (Distribution-free selective inference?) 21/22

Thank you! Website: http://www.stat.uchicago.edu/~rina/ Preprint: http://arxiv.org/abs/1706.01852 Thanks to funding from NSF & Sloan Fellowship 22/22