Unsupervised Regressive Learning in High-dimensional Space

Size: px

Start display at page:

Download "Unsupervised Regressive Learning in High-dimensional Space"

Claude Heath
5 years ago
Views:

1 Unsupervised Regressive Learning in High-dimensional Space University of Kent ATRC Leicester 31st July, 2018

2 Outline Data Linkage Analysis High dimensionality and variable screening Variable screening and mixture models EPD mixture regression models EPD mixture-based variable selection Simulation studies Conclusion

3 Data linkage analysis Data telematic devices (time-dependent measurements), credit reports (discrete), satellite data (time series), genetic data, historical records from policy administration systems. Insurers are demonstrating increasing interest in using data linkage to improve their pricing accuracy and facilitate more effective loss prevention. See Policy Briefing from IFoA (2017). However, the utility of data linkage can be compromised by the high-dimensionality, heterogeneity and heavy distribution tails of these data.

4 High dimensionality and variable screening Consider y i = p x ij β j + ε i, 1 i n, j=1

5 High dimensionality and variable screening Consider y i = p x ij β j + ε i, 1 i n, j=1 where ε i s are i.i.d. N(0, 1) and there are many more variables than the sample size.

6 High dimensionality and variable screening Consider y i = p x ij β j + ε i, 1 i n, j=1 where ε i s are i.i.d. N(0, 1) and there are many more variables than the sample size.

7 High dimensionality and variable screening LASSO: Estimated coefficients by the L 1 penalised least squares. Correlation screening: To screen variables, for each j, we single out the j-th covariate and rewrite the above equation as y i = x ij β j + ε i, with ε i = t j x itβ t + ε i, 1 i n.

8 High dimensionality and variable screening LASSO: Estimated coefficients by the L 1 penalised least squares. Correlation screening: To screen variables, for each j, we single out the j-th covariate and rewrite the above equation as y i = x ij β j + ε i, with ε i = t j x itβ t + ε i, 1 i n. This gives rise to what is called correlation variable screening.

9 Variable screening and mixture models In general, LASSO is not efficient when y i is heterogeneous. Correlation variable screening is also not efficient,

10 Variable screening and mixture models In general, LASSO is not efficient when y i is heterogeneous. Correlation variable screening is also not efficient, if covariate observations have a group structure, where ε i s are heterogeneously distributed, or

11 Variable screening and mixture models In general, LASSO is not efficient when y i is heterogeneous. Correlation variable screening is also not efficient, if covariate observations have a group structure, where ε i s are heterogeneously distributed, or if {x it : 1 t p} are heavy tailed.

12 Variable screening and mixture models In general, LASSO is not efficient when y i is heterogeneous. Correlation variable screening is also not efficient, if covariate observations have a group structure, where ε i s are heterogeneously distributed, or if {x it : 1 t p} are heavy tailed. To address the issues, we first consider a family of distributions for y i s called EPD.

13 EPD φ(y µ, σ, α) = ) α ( 2σΓ(1/α) exp y µ α σ α, where µ (, ), α > 0 and σ > 0. It is normal if α = 2 and Laplace if α = 1. Figure:

14 EPD mixture regression models Let (y i, x i ), i = 1,..., n be independent observations on response y and p-dimensional covariate x.

15 EPD mixture regression models Let (y i, x i ), i = 1,..., n be independent observations on response y and p-dimensional covariate x. We then consider f (y i x i, Θ K ) = K π k φ(y i x T i β k, σk 2, α k), k=1

16 EPD mixture regression models Let (y i, x i ), i = 1,..., n be independent observations on response y and p-dimensional covariate x. We then consider f (y i x i, Θ K ) = K π k φ(y i x T i β k, σk 2, α k), k=1 where Θ K denotes the set of all the parameters, φ(y i x T i β k, σ 2 k, α k) is the k-th component density with regression coefficients β k = (β k1,..., β kp ) T R p,

17 EPD mixture regression models Let (y i, x i ), i = 1,..., n be independent observations on response y and p-dimensional covariate x. We then consider f (y i x i, Θ K ) = K π k φ(y i x T i β k, σk 2, α k), k=1 where Θ K denotes the set of all the parameters, φ(y i x T i β k, σ 2 k, α k) is the k-th component density with regression coefficients β k = (β k1,..., β kp ) T R p, σ 2 k (0, ), α k (0, ), proportion π k 0, and K k=1 π k = 1.

18 EPD mixture-based variable selection The new proposal: For each observation, we first construct the penalized likelihood and then combine these likelihoods together by a component-wise weighting:

19 EPD mixture-based variable selection The new proposal: For each observation, we first construct the penalized likelihood and then combine these likelihoods together by a component-wise weighting: pl n (Θ K (y i, x i )) = K π k φ(y i x T i β k, σk 2, α k) k=1 exp ( λ β ) k 1 + κ 0 1 σ k σ 2/n. k

20 EPD mixture-based variable selection The proposed penalized likelihood: pl n (Θ K (Y, X)) = n K pl n (Θ K (y i, x i )) i=1 k=1 π δ k k, where δ k, k = 1,..., K are pre-specified constants with default δ k = 1/K.

21 EPD mixture-based variable selection The proposed penalized likelihood: pl n (Θ K (Y, X)) = n K pl n (Θ K (y i, x i )) i=1 k=1 π δ k k, where δ k, k = 1,..., K are pre-specified constants with default δ k = 1/K. The number of components, K, is chosen by minimizing a BIC with respect to 1 K K n and λ 0 λ λ 1. The advantage of the new proposal over the existing one lies in computation and the convergence of the GEM algorithm.

22 Simulation studies We considered the following four screening procedures in simulation studies (λ = 0): Correlation learning or simple Gaussian linear regression (GAU1)

23 Simulation studies We considered the following four screening procedures in simulation studies (λ = 0): Correlation learning or simple Gaussian linear regression (GAU1) Simple EPD linear regression (EPD1)

24 Simulation studies We considered the following four screening procedures in simulation studies (λ = 0): Correlation learning or simple Gaussian linear regression (GAU1) Simple EPD linear regression (EPD1) Simple Gaussian mixture regression (GAUMIX, BIC-based)

25 Simulation studies We considered the following four screening procedures in simulation studies (λ = 0): Correlation learning or simple Gaussian linear regression (GAU1) Simple EPD linear regression (EPD1) Simple Gaussian mixture regression (GAUMIX, BIC-based) Simple EPD mixture regression (EPDMIX, BIC-based).

26 Simulation studies We considered the following four screening procedures in simulation studies (λ = 0): Correlation learning or simple Gaussian linear regression (GAU1) Simple EPD linear regression (EPD1) Simple Gaussian mixture regression (GAUMIX, BIC-based) Simple EPD mixture regression (EPDMIX, BIC-based). We compare the performances of GAU1, EPD1, GAUMIX, and EPDMIX in screening out non-active covariates in the model in terms of specificity and sensitivity.

27 Simulation studies Setting (multiple linear regression): We generated 40 datasets with the sample size n and the dimension p. Each dataset contained observations (y i, x ij ), 1 j p, 1 i n satisfying y i = p x ij β 0j + ε i, j=1 where ε i, 1 i n were iid N(0, 1), and the regression coefficients β 0 = (2 + η 1, η 2, η 3, η 4, η 5, 0 T p 5) T, where η j, 1 j 5, were iid N(0, ), and 0 p 5 was a p 5 vector of zeros.

28 Simulation studies Setting (multiple linear regression): We generated 40 datasets with the sample size n and the dimension p. Each dataset contained observations (y i, x ij ), 1 j p, 1 i n satisfying y i = p x ij β 0j + ε i, j=1 where ε i, 1 i n were iid N(0, 1), and the regression coefficients β 0 = (2 + η 1, η 2, η 3, η 4, η 5, 0 T p 5) T, where η j, 1 j 5, were iid N(0, ), and 0 p 5 was a p 5 vector of zeros. There were five active covariates in the model.

29 Simulations Setting (Gaussian mixture regression): We generated 40 datasets with the sample size n, the dimension p and K 0 components.

30 Simulations Setting (Gaussian mixture regression): We generated 40 datasets with the sample size n, the dimension p and K 0 components. Given x i = (x i1,..., x ip ) T s, y i, 1 i p were independently sampled from K 0 f (y i ) = π k φ(y i x T i β k ), k=1 where φ(.) is the density of the standard normal distribution.

31 Simulations We considered the two cases of K 0 : (1) K 0 = 2, where there are two components with β 1 = (2 + v 1, v 2, v 3, v 4, v 5, 0 T p 5) T, β 2 = (0, 0, 0, 4 + v 21, 4 + v 22, 4 + v 23, 4 + v 24, 4 + v 25, 0 T p 8) T, where v j, 1 j 5, are iid N(0, ) and 0 p 5 is a p 5 vector of zeros.

32 Simulations We considered the two cases of K 0 : (1) K 0 = 2, where there are two components with β 1 = (2 + v 1, v 2, v 3, v 4, v 5, 0 T p 5) T, β 2 = (0, 0, 0, 4 + v 21, 4 + v 22, 4 + v 23, 4 + v 24, 4 + v 25, 0 T p 8) T, where v j, 1 j 5, are iid N(0, ) and 0 p 5 is a p 5 vector of zeros. (2) K 0 = 3, where there are three components with β 1 = (2 + v 11, v 12, v 13, v 14, v 15, 0 T p 5) T, β 2 = (0, 0, 0, 4 + v 21, 4 + v 22, 4 + v 23, 4 + v 24, 4 + v 25, 0 T p 8) T, β 3 = (0, 0, 0, 0, 0, 0, 4 + v 31, 4 + v 32, 0 T p 8) T, where v kj, 1 j 5, k = 1, 2, v 31, v 32 are iid N(0, ), and 0 p 8 is a p 8 vector of zeros.

33 Simulations We considered the two cases of K 0 : (1) K 0 = 2, where there are two components with β 1 = (2 + v 1, v 2, v 3, v 4, v 5, 0 T p 5) T, β 2 = (0, 0, 0, 4 + v 21, 4 + v 22, 4 + v 23, 4 + v 24, 4 + v 25, 0 T p 8) T, where v j, 1 j 5, are iid N(0, ) and 0 p 5 is a p 5 vector of zeros. (2) K 0 = 3, where there are three components with β 1 = (2 + v 11, v 12, v 13, v 14, v 15, 0 T p 5) T, β 2 = (0, 0, 0, 4 + v 21, 4 + v 22, 4 + v 23, 4 + v 24, 4 + v 25, 0 T p 8) T, β 3 = (0, 0, 0, 0, 0, 0, 4 + v 31, 4 + v 32, 0 T p 8) T, where v kj, 1 j 5, k = 1, 2, v 31, v 32 are iid N(0, ), and 0 p 8 is a p 8 vector of zeros. For each case of K 0, we considered (n, p) = (300, 400) and (500, 600).

34 Table: Percentage increase of average specificity compared to the GAU1 in variable screening Setting 4.1.1: single component Sensitivity 5/5 4/5 3/5 2/5 1/5 Percentage increase of ave. spe. (%) (n, p) = (500, 600) GAU EPD EPDMIX GAUMIX (n, p) = (100, 2000) GAU EPD EPDMIX GAUMIX

35 Table: Percentage increase of average specificity compared to the GAU1 in variable screening Setting 4.1.2: multiple components Sensitivity 8/8 7/8 6/8 5/8 4/8 3/8 2/8 1/8 Percentage increase of ave. spe. (%) Two components: (n, p) = (300, 400) GAU EPD EPDMIX GAUMIX Two components: (n, p) = (500, 600) GAU EPD EPDMIX GAUMIX

36 Table: Percentage increase of average specificity compared to the GAU1 in variable screening Setting 4.1.2: multiple components Sensitivity 8/8 7/8 6/8 5/8 4/8 3/8 2/8 1/8 Percentage increase of ave. spe. (%) Three components:(n, p) = (300, 400) GAU EPD EPDMIX GAUMIX Three components:(n, p) = (500, 600) GAU EPD EPDMIX GAUMIX

37 Conclusion We have proposed a new approach for upsupervised regressive learning.

38 Conclusion We have proposed a new approach for upsupervised regressive learning. The proposal has been shown to outperform the existing procedures by simulation studies.

39 Conclusion We have proposed a new approach for upsupervised regressive learning. The proposal has been shown to outperform the existing procedures by simulation studies. Comparison to LASSO also favored our approach.

40 Thank you!

Linear Models for Regression CS534

Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict