Distribution-Free Distribution Regression

Size: px

Start display at page:

Download "Distribution-Free Distribution Regression"

Caren Boone
5 years ago
Views:

1 Distribution-Free Distribution Regression Barnabás Póczos, Alessandro Rinaldo, Aarti Singh and Larry Wasserman AISTATS 2013 Presented by Esther Salazar Duke University February 28, 2014 E. Salazar (Reading group) February 28, / 11

2 Outline 1 Regression Models 2 Distribution Regression 3 Regression problem The kernel-kernel estimator Upper bound on risk 4 Numerical ilustration E. Salazar (Reading group) February 28, / 11

3 Regression Models Standar regression model: predict a real-valued response Y i R from a covariate vector X i R d, Y i = βx i + µ i An extension: functional regression. The covariate is a function instead of a finite dimensional vector Y i = f(x i ) + µ i Distribution regression: the covariate is a probability distribution P i, i.e. Y i = f(p i ) + µ i where f is an unknown function and µ is a random error E. Salazar (Reading group) February 28, / 11

4 Distribution Regression Model: Y i = f(p i ) + µ i This differs from functional regression in two ways: 1 P is probability measure on R k rather than a one-dimensional function 2 Important: We do not observe the covariate P directly. We observe a sample from P E. Salazar (Reading group) February 28, / 11

5 Distribution Regression Model: Y i = f(p i) + µ i Example: Patient classification Suppose we have m patients, the goal is to predict the class label Y i { healthy, diseased }, i = 1,..., m Traditional machine learning based approach: using feature vectors X i R d we can apply a standard classifier to predict the class labels of the feature vectors Problem: features based on heart rate, blood pressure, and other medical conditions in our body are always changing Let the set of these measurements be {X i,1,..., X i,ni } where X i,ni R d We could construct a new feature vector X i = 1 ni n i j=1 Xi,j (but we lose information) In distribution regression model we say that each person is presented by an unknown distribution P i so X i,j P i, Goal: Classify these unknown P i distributions E. Salazar (Reading group) February 28, / 11

6 Regression problem Regression problem with variables (P 1, Y 1),..., (P m, Y m) where Y i R and each P i is a probability distribution Y i = f(p i) + µ i, i = 1,..., m f is a function (defined later) µ i is a noise variable with mean 0 We do not observe P i but we observe a sample X i,1,..., X i,ni P i Observed data: (X 1, Y 1),..., (X m, Y m), where X i = {X i,1,..., X i,ni } Goal is to predict a new Y m+1 from a new batch X m+1 drawn from a new distribution P m+1 E. Salazar (Reading group) February 28, / 11

7 The kernel-kernel estimator The authors assume that the distributions P i are an i.i.d sample from a measure P Estimator ˆf for the unknown function f: Let ˆP i denote an estimator of P i and let X be sample from a new distribution P = P m+1. Predictor for Y m+1 is Ŷm+1 = ˆf( ˆP m+1 ) E. Salazar (Reading group) February 28, / 11

8 To complete the definition we need to specify ˆP i, ˆP and D The estimate P i, more precisely, the density p i of P i with a kernel density estimator B is an appropriate kernel function (Tsybakov,2010) with bandwidth b i > 0 and ˆP i is defined by ˆP i (A) = ˆp i (u)du For D: for any two probabilities in P and Q they take D(P, Q) to be the L 1 distance of their densities D(P, Q) = p q = p(x) q(x) dx Therefore, the kernel-kernel estimator is A E. Salazar (Reading group) February 28, / 11

9 E. Salazar (Reading group) February 28, / 11

10 Concern: upper bounding the risk Upper bound on risk Note that the absolute prediction risk is E Ŷ c = E( µ ) is a constant Y R(m, n) + c where Main result: E. Salazar (Reading group) February 28, / 11

11 Numerical illustration They assume n = n 1 =... = n m = 500 (set sizes) m = 250 sample sets for training, 25 for validation and 50 for testing Each sample set contained n = 500 Beta(a, 3) distributed i.i.d. points Task: learn the skewness of Beta(a, b) distributions, f = 2(b a) a+b+1 (a+b+2) ab The estimator is not aware of that the sample sets are coming from beta distributions and also it does not know the skewness function. Kernel used: k(x) = 1 x if 1 x 1, and 0 otherwise E. Salazar (Reading group) February 28, / 11

Least Squares Classification

Least Squares Classification Stephen Boyd EE103 Stanford University November 4, 2017 Outline Classification Least squares classification Multi-class classifiers Classification 2 Classification data fitting