Estimating Posterior Ratio for Classification: Transfer Learning from Probabilistic Perspective

Size: px

Start display at page:

Download "Estimating Posterior Ratio for Classification: Transfer Learning from Probabilistic Perspective"

Nelson Simmons
5 years ago
Views:

1 Estimating Posterior Ratio for Classification: Transfer Learning from Probabilistic Persective Song Liu, Kenji Fukumizu arxiv: v3 [stat.ml] 9 Oct 205 Abstract Transfer learning assumes classifiers of similar tasks share certain arameter structures. Unfortunately, modern classifiers uses sohisticated feature reresentations with huge arameter saces which lead to costly transfer. Under the imression that changes from one classifier to another should be simle, an efficient transfer learning criteria that only learns the differences is roosed in this aer. We train a osterior ratio which turns out to minimizes the uer-bound of the target learning risk. The model of osterior ratio does not have to share the same arameter sace with the source classifier at all so it can be easily modelled and efficiently trained. The resulting classifier therefore is obtained by simly multilying the existing robabilistic-classifier with the learned osterior ratio. Keywords: Transfer Learning, Domain Adatation. Introduction Transfer learning [2, 3, 6] trains a classifier using limited number of samles with the hel of abundant samles drawn from another similar distribution. Secifically, we have a target task roviding a very small dataset D P as well as a slightly different source task with a large dataset D Q. The Transfer Learning [2, 3, 6] usually refers to rocedures that make use of the similarity between two learning tasks to build a suerior classifier using both datasets. In this aer, we focus on robabilistic classification roblems where the goal is to learn a class osterior y x over D P, where y x is the conditional robability of class labels given an inut x. Due to its comlexity of arametrization, the redicting function is usually encoded in the hardware and executed with great efficiency, thus it is reasonable to look at a comosite algorithm that consists of two arts: a fixed but fast build-in classifier offering comlicated redicting attern and a light-weight rocedure works as an adater that transfers the classifier for a variety of slightly different situations. For examle, a general-urose facial Institute of Statistical Mathematics, Tokyo, Jaan Institute of Statistical Mathematics, Tokyo, Jaan

2 recognition built in a camera cannot change its redicting behavior once its model is trained, however the camera may learn transfer models and adjust itself for recognizing a target user. The challenge is, the transfer rocedure is exected to resonse raidly while learning over the entire feature set of the source classifier may slow us down dramatically. Intuitively, learning a transfer model does not necessarily need comlicated features. Since the task is still facial recognition, we can assume that the changes from one classifier to another are simle and can be described by a trivial say linear model with a few key ersonal features say hair-style or glasses. The general human facial modelling also lays an imortant role, however, we may safely assume that such modelling has been taken care of in the source classifier and remain unchanged in the target task. Thus, we can consider the incremental model only in the transfer rocedure. One of the oular assumtions in transfer learning is to reuse the model from the source classifier by training a target classifier and limiting the distance between it and the source classifier model. Regularization has been utilized to enforce the closeness between learned models [6]. More comlicated structures, such as deendencies between task arameters are also used to construct a good classifier [3]. As most methods require to learn two classifiers of two tasks simultaneously, some works can take already trained classifiers as auxiliary models and learn to reuse their model structures [8, 2, 5]. However, reusing the existing model means we need to bring the entire feature set from the source task and include them in the target classifier during transfer learning, even if we know that a vast majority of them does not contribute to the transition from the source to the target classifier. Such an overly exressive model can be harmful given limited samles in D P. Moreover, the hyer-arameters used for constructing features may also be difficult to tune since the cross-validation may be oor on such a small dataset D P. Finally, obtaining those features in some alications may be time-consuming. Another natural idea of transfer learning is to borrow informative samles from the D Q, and get rid of harmful samles. TrAdaBoost [4] follows this exact learning strategy to assign weights to samles from both D P and D Q. By assigning high weights to samles contributes to the erformance in the target task, and enalizing samles that misleads the classifier, TrAdaBoost reuses the knowledges from both datasets to construct an accurate classifier on the target task. The idea of imortance samling also gives rise to another set of methods learning weights of samles by using density ratio estimation [4, 9, 9]. Using unlabelled samles from both datasets, an imortance weighting function can be learned. By lugging such function into the emirical risk minimization criterion [6], we can use samles from the D Q as if they were samles from D P. However, such method can not allow incremental modelling as well, since it learns a full classifier model during the transfer. It can be noticed that if one can directly model and learn the difference between target and source classifier, one may use only the incremental features which leads to a much more efficient learning criteria. The first contribution of this aer is showing that such difference learning is in fact the learning of a osterior ratio which is the ratio between the osteriors from source and target tasks. We show learning such osterior ratio is equivalent to minimizing the uer-bound of 2

3 the classification error of the target task. Second, an efficient convex otimization algorithm is given to learn the arameters of the osterior ratio model and is roved to give consistent estimates under mild assumtions. Finally, the usefulness of this method is validated over various artificial and real-world datasets. However, we do not claim that the roosed method has suerior erformance against all existing works based on extra assumtions, e.g. the smoothness of the redicting function over unlabeled target samles[5, 2]. The roosed method is simly a novel robabilistic framework working on a very small set of assumtions and offers the flexibility of modelling to transfer learning roblems. It is fully exendable to various roblem settings once new assumtions are made. 2 Problem Setting Consider two sets of samles drawn indeendently from two robability distributions Q and P on {, } R d : D Q = { } y q j, x j n i.i.d. q Q, D P = { y i, x i j= } n i.i.d. P D Q and D P are source and target dataset resectively. We denote y x and qy x as the class osteriors in P and Q resectively. Moreover, n n. Our target is to obtain an estimate of the class osterior ˆy x and redict the class label of an inut x by ŷ = argmax y {,} ˆy x. Clearly, if n is large enough, one may aly logistic regression [3, 20] to obtain a good estimate. In this aer, we focus on a scenario where n is relatively small and n is sufficiently large. Thus, it is desirable if we can transfer information from the source task to boost the erformance of our target classifier. 3 Comosite Modeling Note that the osterior y x can be decomosed into where y x qy x y x = y x qy x qy x, is the class osterior ratio, and the qy x is a source classifier. This decomosition leads to a simle transfer learning methodology: Model and learn the osterior ratio and general-urose classifier searately, then later multily them together as an estimate of the osterior. The main interest of this aer is learning such comosite model using samles from D P and D Q. Now, we introduce two arametric models gy, x; or g for short and qy, x; β or q β for short for y x qy x and qy x resectively. 3

4 3. Kullback-Leibler Divergence Minimization A natural way of learning such a model is to minimize the Kullback-leibler KL [0] divergence between the true osterior and our comosite model. Definition Conditional KL Divergence. KL [ q] = P log y x qy x, We denote P f as the short hand of the integral/sum of a function f over a robability distribution P on its domain. Now, we roceed to obtain the following uer-bound of KL divergence from to the comosite model: Proosition Transfer Learning Uer-bound. if y,x qy,x C max < and 0 < q β <, then the following inequality holds KL [ g q β ] KL [ g q] + C max KL [q q β ] + C, where C is a constant that is irrelevant to or β. Proof. [ ] KL [ g h β ] = KL g q qβ q =KL [ g q] P log q β + P log q y, x =KL [ g q] qy, x qy, x log q β dyx + P log q KL [ g q] + C max Q log q C max Q log q β + C 2 =KL [ g q] + C max KL [q q β ] + C, where C = P log q C max Q log q. Further, KL [ g q] + C max KL [q q β ] n log g y i, x i ; n C max n n j= where C is a constant that is irrelevant to or β. log q y q j, x j q ; β + C 3 4

5 We may minimize the emirical uer-bound 3 of KL divergence in order to obtain estimates of and β. C max is an unknown constant introduced in 2 that illustrates the how dissimilar these two tasks are. Such uer-bound in formalizes the common intuition that if two tasks are similar, transfer learning should be easy, since the more similar two tasks are, the smaller the C max is, and the tighter the bound is. Note that the minimizing 3 leads to two searate maximum likelihood estimation MLE. The MLE of the second likelihood term of bound 3 ˆβ = argmax β n n j= log q y q j, x j q ; β leads to a conventional MLE of a osterior model, and has been well studied. q can be efficiently modeled and trained using techniques such as logistic regression [3, 20]. Here we consider it is already given. However, maximizing the first likelihood term, a osterior ratio ˆ = argmax n n log g y i, x i ; 4 is our main focus. In the next section, we show the modelling and learning of the osterior ratio is feasible and comutationally efficient. 3.2 Posterior Ratio Model Although it is not necessary, to illustrate the idea behind the osterior ratio modelling, we assume y x and qy x belongs to the exonential family, e.g. y x can be arametrized as: m y x; β ex y β i h i x, 5 Given the arametrization model 5, consider the ratio between and q: y x; β m qy x; β q ex y β,i β q,i h i x. For all β,i β q,i = 0, factor feature f i is nullified, and therefore can be ignored when modelling the ratio. In fact, once the ratio is considered, the searate β and β q does not have to be learned, but only their difference i = β,i β q,i is sufficient to describe the transition from to q. Thus, we write our osterior ratio model as ry, x; = Nx; ex y i h i x, 6 i S 5

6 where S = {i β,i β q,i 0} and Nx; is the normalization term defined as Nx; = qy x ex y i h i x. i S y {,} Such normalization is due to the fact that we are minimizing the KL divergence between y x and gy, x; qy x, we need to make sure that gy, x; qy x is a valid conditional robability, i.e., : y qy x g y, x; =. This modelling technique gives us great flexibilities since it only concerns the effective features {h i } i S rather than the entire feature set {h, h 2,..., h m }. In this aer, we assume the transfer should be simle, thus the otential feature set only contains simle features, such as linear ones: h i x = x i, i S. From now on, we simlify y i S ih i x using a linear reresentation fy, x, where fy, x = [yh a x, yh a2 x,..., yh am x], where a, a 2,..., a m S. However, this modelling also causes a roblem: We cannot directly evaluate the outut value of this model, since we do not have access to the true osterior qy x. Therefore, we can only use samles from D Q to aroximate the normalization term. 4 Estimating Posterior Ratio Now we introduce the estimator of the class-osterior ratio y x/qy x. Let us substitute the model of 6 into the objective 4: ˆ = argmax n = n n n log g y i, x i ; f y i, x i n n log N, x i. The normalization term needs to be evaluated in a ointwise fashion N, x i { Note that if we have sufficient observations y q, x aired with each x i, i.e. Q, x = x i, such normalization can be aroximated efficiently via samle average:, x i D P. } k, x y j q j= N, x k k j= ex f y j q, x. However, in ractice not many observed samles may be aired with x i. Esecially when x is in a continuous domain, we may not observe any aired samle at all. We may consider 6

7 ex f y, x E q ex f y, x y q j, x q j x i PX ~Q YX x x using the neighbouring airs Figure : Aroximate N, x using nearest neighbours. y j q, x j q where x j q is a neighbour of x i to aroximate N, x i, which naturally leads to the idea of k-nearest neighbours k-nn estimation of such quantity see Figure : N, x i N n,k = k ; x i j N n x i,k ex f y j q, x j, q where { N n x i, k = j xj q } is one of the k-nns of x i. Now we have a comutable aroximation to the osterior ratio model: g n y, x; = ex fy, x. N n,kx; The resulting otimization is ˆ = argmin + n log n k l; D P, D Q = n j N q x i,k n ex f y j q f y i, x i, x j q, 7 which is convex. Note l reresents the negative likelihood. Moreover, if we assume that the changes between two osteriors are mild, i.e. i = β,i β q,i is small, we may use an extra l 2 regularization to restrict the magnitude of our model arameter : argmin l + λ 2, 8 7

8 where the λ is a regularization term and can be chosen via likelihood cross-validation in ractice. Finally the gradient of l is given as l = n + n n n f y i E n, x i [ ] g n y, x; fy, x x = xi, [ where E ] n Z x = x i is the emirical k-nn estimate of a conditional exectation over Q: [ ] E n Z x = xi = Z j. k j N n x i,k The comutation of this gradient is straightforward, and thus we can use any gradient-based method such as quasi-newton to solve the unconstrained convex otimization in 8. It can be noticed that such algorithm is similar to the density ratio estimation method, KLIEP [5]. Indeed, they are all estimators of learning a ratio function between two robabilities based on maximum-likelihood criteria. However, the roosed method is different from [5] in terms of modelling, motivation and usage. 5 Consistency of the Estimator In this section, we analyze the consistency of the estimator given in 7, i.e. whether the estimated arameter converges to the solution of the oulation objective function. This result is not straightforward since we used an extra k-nn aroximation in our model so that the model itself is an estimate. The question is, does this aroximation lead to a consistent estimator? First, we define the estimated and true arameter as: ˆ = argmax lˆ; D P, D Q = P n log g n y, x; = argmax P log gy, x;, where P n is the emirical measure of distribution P. Assumtion Bounded Ratio Model. There exists < M max <, so that fy, x logm max. Moreover, is in a totally bounded metric sace and max y,x fy, x 2 F max where 0 < F max <. Therefore ex fy, x [, Nx; and N n,kx; M max, M max ], and the osterior ratio model is always bounded by constants. It is a reasonable assumtion as the osterior ratio measures the differences between two tasks, the true osterior ratio must be close to one if two tasks are similar. 8

9 Assumtion 2 Bounded Covariate Shift. x qx R max. The suort between P and Q must overla. If samles in D P distribute comletely differently from those in D Q, it does not make sense to exect the transfer learning method would work well. Assumtion 3 Identifiability. is the unique global maximizer of the oulation objective function P log gy, x;, i.e. for all ɛ > 0, su P log gy, x; < P log gy, x;., ɛ Then we have the following theorem that states our osterior ratio estimator is consistent. Theorem. Suose for each x, the random variable X x is absolutely continuous. If n, n, k n / log n and k n /n 0, where k n is the samle deendent version of k, the number of nearest neighbors used in k-nn aroximation. Then under above assumtions, ˆ. Further lˆ; D P, D Q KL [ q]. The roof relies on the following lemma: Lemma. Under all assumtions stated above, if n, n, k n / log n and k n /n 0. Then su P n log g n y, x; P log gy, x; 0, i.e. the error caused by aroximating objective using samles converges to 0 in robability uniformly w.r.t.. One of the key stes is to decomose the above emirical aroximation error of the objective function into: Aroximation error caused by using samles from P + Modelling error caused by k-nn using samles from Q. It can be observed that the bound of density ratio R max also contributes to the error. The comlete roof is included in the aendix. 6 Decomosing Paramter vs. Decomosing Model Instead of decomosing the model,β = g h β as we roose in this aer, the Model-reuse methods e.g. [6, 3] decomose the arameter: β = + β q, which leads to a roblem of minimizing a KL divergence min KL [ h + β q ]. β q, Two issues come with this criteria. First, this roblem is not identifiable since there exist infinitely many ossible combinations of and β q that minimizes the objective function. One must use extra assumtions. Model-reuse methods add a regularizer on arameter β q using KL-divergence. ˆβ q, ˆ = argmin KL [ h + β q ] + γkl [ q hβ q ], 9 β q, 9

xq, Positive xq, Negative 7q, Positive 7q, Negative 3 2 2 0 0 - - -3-3 -6 b negative hold-out likelihood d y x; β, miss-rate: 3.8%.

has to be tuned using cross-validation which may be oor when the number of samles from DP is low.

10 xq, Positive xq, Negative 7q, Positive 7q, Negative b negative hold-out likelihood d y x; β, miss-rate: 3.8%. Positive Negative Positive Negative -2-2 a ` ; DP, DQ x, x, 7q, 7q, c Illustration dataset shift -6 of Gaussian e qy x; β q, miss-rate: 5.2% f gy, x; qy x; β q, miss-rate: 8.0% Figure 2: Exeriments on artificial datasets which imlies that the minimizer β q should also make the difference between q and hβ q small, in terms of KL divergence, and γ is a balancing arameter has to be tuned using cross-validation which may be oor when the number of samles from DP is low. As we will show later in the exeriments, the choice of γ is crucialr to the erformance when n is small. Second, since the model must be normalized, i.e. h + β q dy =, so β q and are always couled, one must always solve them together, meaning the algorithm have to handle the comlicated feature sace for β q and. However, things are much easier if we have access to the true arameter of the osterior β q, then we can model the osterior of as gy, x; qy x; β q, where g is the model of the ratio. This setting leads to the roosed osterior ratio learning method: = argmin KL kgy, x; qy x; β q. where β q is a constant, so this otimization is with resect to only. This aer resents an algorithm that can obtain an estimate of gy, x; even if one does not know qy x; β q exactly. qy x; β q is learned searately and is multilied with gy, x; in order to rovide an osterior outut. In comarison, the decomosition of model results two indeendent otimizations and we are free from the join objective where the choice of the arameter γ is roblematic. Neither do we have to assume that and β q are in the same arameter sace. 0

11 a sci.cryt b sci.electronics c sci.med d sci.sace e talk.olitics.guns f talk.olitics.mideast g talk.olitics.misc h talk.religion.misc Figure 3: 20 News datasets. 7 Exeriments We fix the feature function f as fx, y := y [x, ]. It is consistent with our simle transfer model assumtion discussed in Section Synthetic Exeriments KL convergence The first exeriment uses our trained osterior ratio model to aroximate the conditional KL divergence. Since our estimate ˆ, we hoe to see lˆ; D P, D Q KL [ q] 0 as n, n. We draw two balanced-classes of samles from two Gaussian distributions with different means for P and Q. Secifically, for y = {, },

We draw 5k samles from distribution Q, n samles from P, and k is chosen to minimize the error of conditional mean estimation same below, as it is introduced in the aendix, then train a osterior ratio

12 a kitchen b dvd c books Figure 4: Amazon sentiment datasets. we construct P and Q as follows: qx = Normal2,,qx = Normal 2,, x = Normal.5,,x = Normal.5,. We draw 5k samles from distribution Q, n samles from P, and k is chosen to minimize the error of conditional mean estimation same below, as it is introduced in the aendix, then train a osterior ratio gy, x; ˆ. By varying n and random samling, we may create a lot for averaged lˆ; D P, D Q, with standard error over 25 runs in Figure 2a. The true conditional KL divergence is lotted alongside as a blue horizontal dash-line. To make comarison, we run the same estimation again with 50k samles from Q, and lot in red. The result shows, our estimator does converge to the true KL divergence, and the estimation error shrinks as n. Increasing n also hel slightly reduce the variance comaring the blue error bar with the red error bar. However, such imrovement is not as significant as increasing n. Joint vs. Searated In this exeriment, we demonstrate the effect of introducing a balancing arameter γ of the joint otimization method discussed in Section 6. We simly reuse the dataset in the revious exeriment, and test the averaged negative hold-out likelihood of the aroach described in 9 and the roosed method using D P of various sizes. It can be seen that the choice of the arameter γ has huge effect on the hold-out likelihood when n is small. However, the roosed method is free from such arameter and can achieve a very low likelihood even when using only 0 samles from D P. 4-Gaussian The second exeriment demonstrates how a simle transfer model hels transfer a non-linear classifier. The dataset D Q is constructed using mixtures of Gaussian distributions with different means on horizontal axis and two classes of samles are not linearly searable. To create dataset D P, we simly shift their means away from each other on the vertical dimension See Figure 2c. We comare the osterior functions learned by kernel logistic regression erformed on D P Figure 2d and D Q Figure 2e with the roosed 2

13 transfer learning method Figure 2f which is a multilication of the learned gy, x; ˆ and qy x; ˆβ q. We set n = 40, n = It can be seen from Figure 2d that although kernel logistic regression has learned the rough decision boundary by using D P only, it has comletely missed the characteristics of the osterior function near the class border due to lack of observations. In contrast, built uon a successfully learned osterior function on dataset D Q Figure 2e, the roosed method successfully transferred the osterior function for the new dataset D P, even though it is equied only with linear features Figure 2f. The classification boundary it rovides is highly non-linear. 7.2 Real-world Alications 20-news Exeriments are run on 20-news dataset where articles are groued into major categories such as sorts and sub-categories such as sorts.basketball. In this exeriment, we adot one versus the others scenario: i.e. The task is to redict whether an article is drawn from a sub-category or not. We first construct D P by randomly selecting a few samles from a certain sub-category T and then mix them with equal number of samles from the rest of the categories. D Q is constructed using abundant random samles from the same major- but different sub-categories and random samles from all the rest categories as negative samles. We adot PCA and reduce the dimension to just 20. Figure 3 summarizes the miss-classification rate of the roosed transfer learning algorithm and all the other methods: LogiP logistic regression on D P, LogiQ logistic regression on D Q, TrAdaBoost [4], Reg [6], CovarShift [5, 9] and Adative [8] over different subcategory T in the sci and talk category. The result shows that the roosed method works well in almost all cases, while the comarison methods Reg CovarShift and TrAdaBoost, some times have difficulties in beating the naive base line LogiP and LogiQ. In most cases, Adative cannot imrove much from LogiP. Amazon Sentiment The final exeriment is conducted on the Amazon sentiment dataset, where the task is to classify the ositive or negative sentiment from user s review comments on kitchen, electronics, books and dvds. Since some of the roducts such as electronics are far better reviewed than the others such as kitchen tools, it is ideal to transfer a classifier from a well-reviewed roduct to another one. In this exeriment, we first samle D P from one roduct T and construct dataset D Q using all samles from all other roducts. We aly locality reserving rojection [8] to reduce the original dimension from to 30. The classification error rate is reorted in Figure 4 for T = kitchen, dvd and books. We omit the T = electronics since it is noticed that logip and logiq has very close erformance on this dataset suggesting transfer learning is not helful. It can be seen that the roosed method has also achieved low miss-classification rate on all three datasets, even though Adavtive gradually catches u when n is large enough. Interestingly, Figure 4b and 4c show that logiq can achieve very low error rate, and the 3

14 roosed method manage to reach similar rates. Even if the benefit of transferring is not clear in these two cases, the roosed method does not seem to bring in extra errors by also considering samles from target dataset D P which could have been misleading. 8 Conclusions As modern classifiers get increasingly comlicated, the cost of transfer learning become major concern: As in many alications, the transfer should be both quick and accurate. To reduce the modeling comlexity, we introduce a comosite method: learn a osterior ratio and the source robabilistic classifier searately then combine them together later. As the osterior ratio allows the incremental modeling, features, no matter how comlicated, can be ignored as long as they do not articiate in the dataset transfer. The osterior ratio is learned via an efficient convex otimization and is roved consistent. Exeriments on both artificial and real-world datasets give romising results. References [] D. WK Andrews. Generic uniform convergence. Econometric theory, 802:24 257, 992. [2] R. Chattoadhyay, Q. Sun, W. Fan, I. Davidson, S. Panchanathan, and J. Ye. Multisource domain adatation and its alication to early detection of fatigue. ACM Transactions on Knowledge Discovery from Data TKDD, 64:8, 202. [3] D. R. Cox. The regression analysis of binary sequences. Journal of the Royal Statistical Society. Series B Methodological, ages , 958. [4] W. Dai, Q. Yang, G. R. Xue, and Y. Yu. Boosting for transfer learning. In Proceedings of the 24th International Conference on Machine Learning, ages ACM, [5] L. Duan, I. W. Tsang, D. Xu, and T-S Chua. Domain adatation from multile sources via auxiliary classifiers. In Proceedings of the 26th Annual International Conference on Machine Learning, ages ACM, [6] T. Evgeniou and M. Pontil. Regularized multi task learning. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ages ACM, [7] L. Györfi. A distribution-free theory of nonarametric regression. Sringer Science & Business Media, [8] X. He, D. Cai, S. Yan, and H-J Zhang. Neighborhood reserving embedding. In Comuter Vision, ICCV Tenth IEEE International Conference on, volume 2, ages IEEE,

15 [9] T. Kanamori, S. Hido, and M. Sugiyama. A least-squares aroach to direct imortance estimation. Journal of Machine Learning Research, 0:39 445, [0] S. Kullback and R. A. Leibler. On information and sufficiency. The Annals of Mathematical Statistics, 22:79 86, 95. [] W. K. Newey and D. McFadden. Large samle estimation and hyothesis testing. Handbook of econometrics, 4:2 2245, 994. [2] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 220: , 200. [3] R. Raina, A. Y. Ng, and D. Koller. Constructing informative riors using transfer learning. In Proceedings of the 23rd International Conference on Machine Learning, ages ACM, [4] M. Sugiyama, S. Nakajima, H. Kashima, P. von Bünau, and M. Kawanabe. Direct imortance estimation with model selection and its alication to covariate shift adatation. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20. Curran Associates, Inc., [5] M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. von Bünau, and M. Kawanabe. Direct imortance estimation for covariate shift adatation. Annals of the Institute of Statistical Mathematics, 604: , [6] V. N. Vanik. Statistical Learning Theory. Wiley, New York, NY, USA, 998. [7] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Sringer Publishing Comany, Incororated, 200. [8] J. Yang, R. Yan, and A. G. Hautmann. Cross-domain video concet detection using adative svms. In Proceedings of the 5th International Conference on Multimedia, ages ACM, [9] Y. Zhang, X. Hu, and Y. Fang. Logistic regression for transductive transfer learning from multile sources. In L. Cao, J. Zhong, and Y. Feng, editors, Advanced Data Mining and Alications, volume 644 of Lecture Notes in Comuter Science, ages Sringer Berlin Heidelberg, 200. [20] J. Zhu and T. Hastie. Kernel logistic regression and the imort vector machine. In Advances in Neural Information Processing Systems, ages ,

16 Aendix, Proof for Lemma Proof. First, we decomose suremum of the aroximation error of the emirical objective function: su P n log g n y, x; P log gy, x; = su P n fy, x P n log N n ; x P fy, x P log N ; x su Pn P fy, x + su P n log N n ; x P log N ; x su Pn P fy, x + su P n P log N n ; x + P log N n ; x P log N ; x su Pn P fy, x + 0 su P n P log N n ; x + su P log N n ; x P log N ; x The first two terms in 0 is due to the aroximation using samles from D P, while the third term is the model aroximation error caused by using k-nn to aroximate Nx;. The first two terms are relatively easy to bound. The Uniform Law of Large Numbers see, e.g. Lemma 2.4 in [] can be alied to show the first two terms converges to 0 in robability, since i. is comact, ii. both fy, x and log N n are continuous over, iii. both above functions are Lischitz continuous as we will show later. As to the third term, we first rove for all ɛ > 0 Prob su P log N n ; x P log N ; x > ɛ 0 6

17 by using the following inequality: log a log b a b b. su P log N n ; x P log N ; x su su P P N n ; x N ; x N ; x N n ; x N ; x N ; x M max su P N n ; x N ; x R max M max su Q N n ; x N ; x To show the final line converges to 0 with robability one, we use the Generic Uniform Law of Large Numbers Generic ULLN see [] Theorem.: Theorem 2 Generic ULLN. For a random sequence {G n, Θ, n }, if Θ is a totally bounded metric sace, G n is stochastic equicontinous SE and G n 0,, then su G n 0 as n. Since by assumtion, is bounded. We now verify the rest two conditions of this theorem. The universal consistency of k NN has been roved see [7], Theorem 23.8, Here we restate the results for our conveniences: Theorem 3 Universal consistency of KNN. Given Z is bounded, assume that for each x, the random variable X x is absolutely continuous, if k n / log n and k n /n 0, k n NN estimator is strongly universally consistent, i.e., 2 lim z j E [Z X = x] n k dµx 0 j N n,k x with robability one for all distributions Z, X, where µx is the robability measure of x. From Jensen s inequality, we have k k j N n,k x j N n,k x z j E [Z X = x] dµx 2 z j E [Z X = x] dµx, and it can be seen that the left hand side also converges to 0 in robability. By using the Continuous Maing Theorem, we can finally show that k j N n,k x z j E [Z X = x] dµx converges to 0 in robability. 7 2

18 We let Z = ex fy, X be a new random variable and thus we have samles {z,i, x i } n Z, X drawn from distribution Q, and Q = N n ; x N ; x µx z,j E [Z X = x] k dµx. j N n,k x By alying the Theorem 3, we can conclude, such Q N n ; x N ; x converges 0 in robability for all distribution Z, X indexed by arameter. Next, we verify the SE of Q N n ; x N ; x. Given Assumtion, we have Q N n ; x N ; x Q N n ; x N ; x Q N n ; x N n ; x + N ; x N ; x 2M max F max 2. 2 The last line is due to Mean-value Theorem: ex fy, x ex fy, x fy, x ex fy, x F max M max, where is a vector in-between and elementwisely. In fact, 2 shows the function Q N n ; x N ; x is Lischitz continuous with resect to, and according to Lemma 2 in [], it imlies SE. Similarly, one can show that N n x; is Lischitz continuous. Now we can utilize the roerty of i. boundedness of, ii. SE and iii. universal consistency to conclude that su Q N n ; x N ; x 0, and due to : Prob su P log N n ; x P log N ; x ɛ 0. Similarly, one can rove that Prob su P log N ; x P log N n ; x ɛ 0. As a consequence, the third term in

19 After obtaining Lemma, the rest is similar to the roof of Theorem 9.3 in [7]. Let M := P log gy, x; and M n,n := P n log g n y, x; and M Mˆ = M M n,n ˆ + M n,n ˆ Mˆ M M n,n + M n,n ˆ Mˆ M M n,n + su M n,n M. The last line converges to 0 in robability is roved in Lemma. Therefore, we can write: ɛ > 0, P M Mˆ ɛ 0. Due to Assumtion 3, for an arbitrary choice of ɛ 0 > 0, if ˆ ɛ 0, there must be a ɛ > 0, so that M Mˆ > ɛ. Therefore, we conclude ɛ 0 > 0, P ˆ ɛ 0 P M Mˆ ɛ 0. Also, M n ˆ M = M n ˆ Mˆ + Mˆ M. Due to Lemma, it converges to 0 in robability. Therefore, we have lˆ; D P, D Q KL [ q]. Tuning Parameters in Posterior Ratio Estimation k in k-nn: As it is mentioned in Section 7., k is tuned via 5-fold cross validation, and is based on the testing criterion: MSE = D HO j D HO Z q j k jj N x j q Z q jj 2, 3 where D HO is a holdout dataset and Z q i = ex fy q i, x q i. However, such value deends on and it changes every iteration during the gradient decent. Instead of tuning k after each iteration, we follow a simle heuristics: Fix k and run gradient descent. 2 choose a suitable k that minimizes 3. and 2 are reeatedly carried out until converge. Such heuristics have very good erformance in exeriments. 9

Estimation of the large covariance matrix with two-step monotone missing data

Estimation of the large covariance matrix with two-ste monotone missing data Masashi Hyodo, Nobumichi Shutoh 2, Takashi Seo, and Tatjana Pavlenko 3 Deartment of Mathematical Information Science, Tokyo