A Theoretical Framework for Deep Transfer Learning

A Theoretical Fraewor for Deep Transfer Learning Toer Galanti The School of Coputer Science Tel Aviv University toer22g@gail.co Lior Wolf The School of Coputer Science Tel Aviv University wolf@cs.tau.ac.il Tair Hazan Faculty of Industrial Engineering & Manageent Technion tair.hazan@gail.co Abstract We generalize the notion of PAC learning to include transfer learning. In our fraewor, the linage between the source and the target tass is a result of having the saple distribution of all classes drawn fro the sae distribution of distributions, and by restricting all source and a target concepts to belong to the sae hypothesis subclass. We have two odels: an adversary odel and a randoized odel. In the adversary odel, we show that for binary classification, conventional PAClearning is equivalent to the new notion of PAC-transfer and to transfer generalization of the VC-diension. For regression, we show that PAC-transferability ay exist even in the absence of PAC-learning. In the randoized odel, we provide PAC-Bayesian and VC-style generalization bounds to transfer learning, including bounds specifically derived for Deep Learning. A wide discussion on the tradeoffs between the different involved paraeters in the bounds is provided. We deonstrate both cases in which transfer does not reduce the saple size ( trivial transfer ) and cases in which the saple size is reduced ( non-trivial transfer ). 1 Introduction The advent of deep learning has helped proote the everyday use of transfer learning in a variety of learning probles. Representations, which are nothing ore than activations of the networ units at the deep layers, are used as general descriptors even though the networ paraeters were obtained while training a classifier on a specific set of classes under a specific saple distribution. As a result of the growing popularity of transferring deep learning representations, the need for a suitable theoretical fraewor has increased. In the transfer learning setting that we consider, there are source tass along with a target tas. The source tass are used to aid in the learning of the target tas. However, the loss of the source tass is not part of the learner s goal. As an illustrative exaple, consider the use of deep learning for the tas of face recognition. There are 7 billion classes, each corresponding to a person, and each has its own indicator function (classifier). Moreover, the distribution of the iages of each class is different. Soe individuals are photographed ore casually, while others are photographed in foral events. Soe are photographed ainly under bright illuination, while the iages of others are taen indoors. Hence, a coplete discussion of transfer learning has to tae into account both the classifiers and the distribution of the class saples. A deep face-recognition neural-networ is trained on a sall subset of the classes. For exaple, the DeepFace networ of Taigan et al. (2014) is trained using iages of only 4030 persons. The activations of the networ, at the layer just below the classification layer, are then used as a generic tool to represent any face, regardless of the iage distribution of that person s albu iages. 1

In this paper, we study a transferability fraewor, which is constructed to closely atch the theory of the learnable and its extensions including PAC learning (Valiant, 1984) and VC diension. A fundaental Theore of transfer learning, which lins these concepts in the context of transfer learning, is provided. We introduce the notion of a siplifier that has the ability to return a subclass that is a good approxiation of the original hypothesis class and is easier to learn. The conditions for the existence of a siplifier are discussed, and we show cases of transferability despite infinite VC diensions. PAC-Bayesian and VC bounds are derived, in particular for the case of Deep Learning. A few illustrative exaples deonstrate the echaniss of transferability. A cornerstone of our fraewor is the concept of a factory. Its role is to tie together the distributions of the source tass and the target tas without explicitly requiring the underlying distributions to be correlated or otherwise closely lined. The factory siply assues that the distribution of the target tas and the distributions of the source tass are drawn i.i.d fro the sae distribution of distributions. In the face recognition exaple above, the subset of individuals used to train the networ are a rando subset of the population fro which the target class (another individual) is also taen. The factory provides a subset of the population and a dataset corresponding to each person. The goal of the learner is to be able to learn efficiently how to recognize a new person s face using a relatively sall dataset of the new person s face iages. This idea generalizes the classic notion of learning in which the learner has access to a finite saple of exaples and its goal is to be able to classify wisely a new unseen exaple. 2

Table 1: Suary of notations ɛ, δ error rate and confidence paraeters (0, 1) X instances set Y labels set Z exaples set; usually X Y p a distribution d a tas (a distribution over Z) the nuber of source tass the nuber of saples for each source tas U a finite set of distributions; usually U = {d 1,..., d } or U = {p 1,..., p } E a set of distributions over X E an environent, a set of tass prob p (X) or p(x) the probability of a set X in the distribution p P, E the probability and expectation operators P[X Y, E[X Y the conditional probability and expectation D[K or just D a distribution over distributions (see Definitions 3, 4) K the subject of a factory s = {z 1,..., z } data of exaples i : z i Z S = (s [1,, s t ) source data sets s 1,..., s (of sae size) and one target data set s t o = {x 1,..., x } data of instances i : x i X O = (o [1,, o t ) data of of unlabeled source data sets o 1,..., o (of sae size) and one target data set o t S D[,, n data set S according to the factory D with sizes i [ : s i = and s t = n S D[, source data set S according to the factory D with sizes i [ : s i = U D[ set of tass of size taen fro D d D a tas taed fro D H a hypothesis class (in the supervised case, a set of functions X Y) c a concept; an ite of H C a hypothesis class faily; a set of subsets in H such that H = B B a bias; i.e, B C (and B H) N an algorith that outputs hypothesis classes A an algorith that outputs concepts r(s) the application of an algorith r on data s l : H Z R a loss function 0-1 loss l(c, (x, y)) = ([c(x) = y = true) squared loss l(c, (x, y)) = (c(x) y) 2 /2 T a learning setting; usually T = (H, Z, l) T PB a PAC-Bayes setting; usually T T B = (T, Q, p) T a transfer learning setting; usually T = (T, C, E) ɛ d (c) the generalization ris function = the expectation of l(c, z), i.e, E z d [l(c, z) ɛ s (c) the epirical ris function; ɛ s (c) = 1 s z s l(c, z) g : C E R the infiu ris g(b, d) = inf c B ɛ d (c) = inf{ɛ d (c) : c B} ɛ D (B) transfer generalization ris = E d D [g(b, d) ɛ U (B) source generalization ris = 1 U d U[g(B, d) ɛ s (B, r) 2-step epirical ris = ɛ s (r B (s)) ɛ S (B, r) 2-step source epirical ris = 1 [ɛ si (r B (s i )) R(q) randoized transfer ris = E B q [ɛ D (B) R U (q) randoized source generalization ris = E B q [ɛ U (B) KL(q p) KL-divergence, i.e, KL(q p) = E x q [log(q(x)/p(x)) ɛ p (c 1, c 2 ) the utual error rate; ɛ p (c 1, c 2 ) = ɛ (p,c1 )(c 2 ) ɛ o (c 1, c 2 ) the utual epirical error rate; ɛ o (c 1, c 2 ) = ɛ c1 (o)(c 2 ) err p (B, K) the copatibility error rate; err p (B, K) = sup c1 K inf c 2 B ɛ p (c 1, c 2 ) err o (B, K) the epirical copatibility error rate; err o (B, K) = sup c1 K inf c 2 B ɛ o (c 1, c 2 ) 3

Table 2: Suary of notations (continued) E U (B, K) the source copatibility error rate; E U (B, K) = 1 U p U err p (B, K) E(B, K) the generalization copatibility error rate; E(B, K) = E p D [err p (B, K) E O (B, K) the source epirical copatibility error rate; E O (B, K) = 1 O o O err o (B, K) h V,E,σ,w a neural networ with architecture (V, E, σ) and weights w : E R H V,E,σ set of all neural networs with architecture (V, E, σ) HV,E,σ I faily of all subsets of H V,E,σ deterined by fixing weights on I E E = I J a set of edges in a neural networ, I is the set of edges in the transfer architecture and J the rest of the edges (i.e, I J = ) H V,E, j,σ the architecture induced by (V, E, σ) when taing only the first j layers (see Section 6) ERM B (s) epirical ris iniizer; ERM B (s) = arg in c B ɛ s (c) C-ERM C (s [1, ) class epirical ris iniizer; 1 C-ERM C (s [1, ) = arg in in c B ɛ si (c) c i,b epirical ris iniizer in B for the i th data set; c i,b = ERM B(s i ) r i,b the application of a learner r B of B on s i ; r i,b = r B (s i ) u v concatenation of the vectors u, v 0 s a zeros vector of length s 1 a unit atrix N u (ɛ, δ) a universal bound on the saple coplexity for learning any hypothesis class of VC diension u E h the set of all diss around 0 that lie on the hyperplane h vc(h) the VC diension of the hypothesis class H τ H () the growth function of the hypothesis class H; i.e, τ H () = ax {x1,...,x } X {(c(x 1 ),..., c(x )) : c H} τ(,, r) the transfer growth function ofthe hypothesis class H; i.e, τ(,, r) = ax {s1,...,s } Z {(r 1,B (s 1 ),..., r,b (s )) : B C } τ(, ; C, K) the adversary transfer growth function; i.e, τ(, ; C, K) = ax {o1,...,o } X { c 1,1 (o 1 ), c 1,2,..., c,1 (o ), c,2 (o )) : c i,1 K and c i,2 = ERM B (c i,1 (o)) s.t B C } 4

2 Bacground In this part, a brief introduction of the bacground required is provided. The general learning fraewor, the PAC-Bayesian setting and deep learning are introduced. These subjects are used and extended in this wor. A reader who is failiar with these concepts, ay sip to the next sections. The general learning setting Recall the general learning setting proposed by Vapni (1995). This setting generalizes classification, regression, ulticlass classification, and several other learning settings. Definition 1. A learning setting T = (H, Z, l) is specified by, A hypothesis class H. An exaples set Z (with a siga-algebra). And a loss function l : H Z R. This approach helps to define supervised learning settings such as binary classification and regression in a foral and very clean way. Furtherore, in this fraewor, one can define learning scenarios when the concepts are not functions of exaples, but still have relations with exaples fro Z easured by loss functions (e.g, clustering, density estiation, etc.). If nothing else is entioned, T stands for a learning setting. We say that T is learnable if the corresponding H Is learnable. In addition, if H has a VC diension d, we say that T also has a VC diension d. With these notions, we present an extended transfer learning setting, as a special case of the general learning setting with a few changes. If a distribution d over Z is specified, the fitting of each c H is easured by a Generalization Ris, ɛ d (c) = E z d [l(c, z) Here, H, Z and l are nown to the learner. The distribution d is called a tas and is ept unnown. The goal of the learner is to pic c H that is closest to inf c H ɛ d (c). Since the distribution is unnown, this cannot be coputed directly and only approxiated using an epirical data set {z 1,..., z } selected i.i.d according to d. In any achine learning algoriths, the epirical ris function, ɛ s (c) = 1 z s l(c, z) has great ipact in the selection of the output hypothesis. Binary classification: Z = X {0, 1} and H consisting of c : X {0, 1} with l a 0-1 loss. Regression: Z = X Y where X and Y are bounded subsets of R n and R respectively. H is a set of bounded functions c : X R and l is any bounded function. One of the early breathroughs in statistical learning theory was the seinal wor of Vapni & Chervonenis (1971) and the later wor of Bluer et al. (1989), which characterized binary classification settings as learnable if and only if the VC diension is finite. The VC diension is the largest size required to ensure that there is a set of exaples (of that size) such that any configuration of labels on it is consistent with one of the functions in H. Their analysis was based on the growth function, τ H () = ax {c(x 1 ),..., c(x ) : c H}, where o = {x 1,..., x } o X A faous Lea due to Sauer (1972) asserts that whenever the VC diension of the hypothesis class H is finite, then the growth function is polynoial in, ( ) vc(h) e τ H () when > vc(h) (1) vc(h) Theore 1 (Vapni & Chervonenis (1971)). Let d be any distribution over an exaples set Z, H a hypothesis class and l : H Z {0, 1} be the 0-1 loss function. Then [ E s d sup ɛ d (c) ɛ s (c) 4 + log(τ H (2)) c H 2 In particular, whenever the growth function is polynoial then the generalization ris and epirical ris uniforly converge to each other. 5

PAC-Bayes setting The PAC-Bayesian bound due to McAllester (1998) describes the Expected Generalization Ris (or siply expected ris), i.e, the expectation of the generalization ris with respect to a distribution over the hypothesis class. The ai is not easuring the fitting of each hypothesis directly but to easure the fitting of different distributions (perturbations) over the hypothesis class. The expected ris is easured by E c q [ɛ d (c) and the Expected Epirical Ris is E c q [ɛ s (c), where s = {z 1,..., z } (satisfying E s d E c q [ɛ s (c) = E c q [ɛ d (c)). The PAC-Bayes bound estiates the expected ris with the expected epirical ris and a penalty ter which decreases as the size of the training data set grows. A prior distribution p dictating a hierarchy between the hypotheses in H is selected. The PAC-Bayesian bound penalizes the posterior selection of q by the relative entropy between q and p, easured by the Kullbac-Leibler divergence. Definition 2 (PAC-Bayes setting). A PAC-Bayes setting T PB = (T, Q, p) is specified by, A learning setting T = (H, Z, l). A set Q of posterior distributions q over H. A prior distribution p over H. The loss l is bounded in [0, 1. There are any variations of the PAC-Bayesian bound. Each of which has its own properties and advantages. In this wor we refer to the original bound due to McAllester (1998). Theore 2 (McAllester (1998)). Let d be any distribution over an exaple set Z, H a hypothesis class and l : H Z [0, 1 be a loss function. Let p be a distribution over H and Q a faily of distributions over H. Let δ (0, 1), then P s d q Q : E KL(q p) + log(/δ) c q[ɛ d (c) E c q [ɛ s (c) + 2( 1) 1 δ Where, KL(q p) = E c q [log(q(c)/p(c)). Deep learning A neural networ architecture (V, E, σ) is deterined by a set of neurons V, a set of directed edges E and an activation function σ : R R. In addition, a neural networ of a certain architecture is specified by a weight function w : E R. We denote H V,E,σ the hypothesis class consisting of all neural networs with architecture (V, E, σ). In this wor we will only consider feedforward neural networs, i.e., those with no directed cycles. In such networs, the neurons are organized in disjoint layers, V 0,..., V N, such that V = N V i. These functions have an output layer V N consisting of only one neuron and input layer V 0 holding the input and one constant neuron that always hold the value 1. The other layers are called hidden. A fully connected neural networ is a neural networ in which every neuron of layer V i is connected to every neuron of layer V i+1. The coputation done in feedforward neural networs is as follows: each neuron taes the outputs (x 1,..., x h ) of the neurons connected to it fro the previous layer and the weights on the edges connecting between the (w 1,..., w h ) and outputs: σ ( ) h w i x i, see Figure 1. The output of the entire networ is the value produced by the output neuron, see Figure 2. In this paper we give special attention for the sign activation function that returns 1 if the input is negative and 1 elsewise. The reason is that such neural networs are very expressive and are easier to analyse. Such networs define copound functions of half-spaces. Before we ove on to sections dealing with general purpose transferability and the special case of deep learning, we would lie to give soe insights on our interpretation of coon nowledge within neural networs. The classic approach to transfer learning in deep learning is done by shared weights. Concretely, soe weights are shared between neural networs of siilar architectures, each solving a different tas. We adopt the following notation, H I V,E,σ = {B u u : I R}, s.t B u = {h V,E,σ,w e I : w(e) = u(e)} and I E to denoe a faily of subclasses of the hypothesis class H V,E,σ, each deterined by a fixing of the weights on the edges in I E. We will also denote by J the copleent (i.e, I J = E and I J = ). This will be a cornersote in forulating shared paraeters between neural networs in 6

Inputs x 1 w 1 Activation function x 2 w 2 Σ σ Output x 3 w 3 Bias 1 w 4 Weights Figure 1: A neuron: four input values; x 1, x 2, x 3, 1, weights; w 1, w 2, w 3, w 4 and σ activation function. Input layer Hidden layer #1 Hidden layer #2 Input #1 Input #2 Input #3 Output Bias 1 Figure 2: A neural networ: feedforward fully connected neural networ with four input neurons and two hidden layers, each containing five neurons. transfer learning. For each B u, every two neural networs h 1, h 2 B u share the sae weights u on the edges in I E, see Figure 3. In ost practical cases the activation of a neuron is deterined by activations fro the previous layers by a set of edges that are either in I are do not intersect I. However, in this paper, for the PAC-Bayes setting of deep learning, the discussion is ept ore general, and activations can be deterined by both transfered weights and non-transfered weights. For VC-type bounds, the discussion is liited to the coon situation in which the architecture is decoposed into two parts: the transfer architecture and the specific architecture, i.e, h V,E,σ,u v = h 2 h 1 Where h 1 is a neural networ consisting of the first j layers and the edges between the (with potentially ore than one output) and h 2 has h 1 s output as input and produces the one output of the whole networ. With the previous notions, this tends to be the case where I consists of all edges between the first j layers, see Figure 4. In this case, the faily of hypothesis classes H I V,E,σ is viewed as a hypothesis class H t (transfer architecture) consisting of all transfer networs with the induced architecture. This hypothesis class consists of ulticlass hypotheses with instance space X = R V 0 and output space Y = { 1, 1} V j. H u serves as the specific architecture. Their decoposition consists of the neural networs in H V,E,σ, H u H t = {h 2 h 1 h 2 H u, h 1 H t } = H V,E,σ Each hypothesis class B H I V,E,σ is now treated as a neural networ h B with M := V j outputs and denote h B ( ) as its output. 7

Input layer Hidden layer #1 Hidden layer #2 Input #1 Input #2 Input #3 Output Input #4 Figure 3: A visualization of HV,E,σ I : I is the set of all the blue edges. Red edges are not transfered. Each bias B u HV,E,σ I is deterined by a fixed vector u consisting of the weights on the edges in I. Note that soe activations are fed by both blue and red edges. Input layer Hidden layer #1 Hidden layer #2 Input #1 Input #2 Input #3 Output Input #4 Figure 4: A decoposition into transfer and specific networs: the blue edges consist of the transfer networ and the red ones are the specific networ. 3 Proble setup In Section 1 we introduced transfer learning as a ultitas learning scenario with source tass and target tas. The learner is provided with data sets fro siilar (yet different) tass and the goal is to coe up with useful nowledge about the coonality between the tass. That way, learning the target tass would be easier, i.e., it would require saller data sets. In transfer learning, there are underlying and transfer probles. The underlying learning proble is the setting of each different learning proble. The transfer proble defines what is transferred during the learning process. We follow the foralis of Baxter (2000) with soe odifications. In our study, the underlying setting will be ost of the tie a realizable binary classification/regression setting with an instance set Z = X Y. The transfser setting T = (T, C, E) is specified by, A hypothesis class faily C, which is a set of subsets of H. With no loss of generality, we will assue that H = B. An environent E, which is a set of tass d. And an objective function g(b, d) = ɛ d (B) := inf c B ɛ d (c). Typically, B C. The transfer learner has access to tass {d i } {d t} (source and target) fro E. One approach to transfer learning is to coe up with B C that fits these tass well. The class B is called a bias. Learning a target tas d t ight require fewer training exaples, when B is learned successfully. In traditional achine learning, data points are sapled i.i.d according to a fixed distribution. In transfer learning, saples are generated by what we call, a Factory. A factory is a process that provides ultiple tass. We suggest two ajor types of factories, Adversary Factories and Ran- 8

Factory d 2 d 3... d d t d 1 z 1 1 z 1 2 z 1 3... z 1... z t 1 z t 2 z t 3... z t n Figure 5: A factory: the sapling of saples z i j fro tass {d i} {d t}. First, the tass are selected either arbitrarily or randoly (depending on the factory type). The saple sets s i = {z i j } are then drawn fro the corresponding tass. doized Factories. The first generates supervised tass (i.e, distributions over Z = X Y). It selects concepts {c i } {c t} alost arbitrarily along to distributions over X. The other selects the tass randoly i.i.d fro a distribution. In Section 4, we ae use of adversary factories, while in Section 5 and Section 6 we use randoized factories instead. In both cases, Figure 5 deonstrates the process done by the factory in order to saple training data sets. The difference between the two types arises fro the ethod used to select the tass. 3.1 The adversary factory A factory selects source tass and a target tas that the learner is tested on. An adversary factory is a type of factory that selects supervised tass (i.e, distributions over Z = X Y). It selects source concepts {c i } that differ fro the target concept c t and are otherwise chosen arbitrarily. The factory also saples i.i.d distributions over X, {p i } {p t}, fro the distribution of distributions D. By the supervised behaviour of the learning setting, we have E = H E where E is a set of distributions over X. Definition 3 (Adversary factory). A factory D[,, n is a process with paraeters [,, n that: Step 1 Selects + 1 tass d 1,..., d, d t such that d i = (p i, c i ) E in the following anner, Saples i.i.d + 1 distributions p 1, p 2,.., p, p t fro a distribution of distributions D. Selects any source concepts c 1, c 2,.., c and one target concept c t out of H such that i [ : c i c t. Step 2 Returns S = (s [1,, s t ) such that s i d i and s t d n t. Notation wise, if n = 0, we will write D[, instead of D[,, 0. When = n = 0, we will siply write D[. To avoid sybol overload, siilar notions will be used to denote randoized factories, depending on the section. For a data set S sapled according to an adversary factory, we denote with O = (o 1,..., o, o t ) the original data set without the labels. This data set is a saple according to o i p i and o t p n t where p 1,..., p, p t D +1. In Section 4, all factories are adversary, while in the following sections they are randoized. A K-factory is a factory that selects all concepts fro K C. K is said to be the subject of the factory. In this paper, we assue that all adversary factories are K-factories for soe unnown bias K C. The sybol K will be preserved to denote the subject a factory. We will often write, in Section 4 and the associated proofs, S D[,, n. This is a slight abuse of notation since the concepts are not saples. It would ean that the clai is true for any selection of the concepts. In soe sense, we can assue that there is an underlying unnown arbitrary selection of concepts, and the data is sapled with respect to the. To avoid overload of notations, we will write O D[,, n to denote the corresponding unlabeled data set version of S. The requireent that c t differs fro c i for all i [ is essential to this first odel. Intuitively, the interesting cases are those in which the target concept was not encountered in the source tass. 9

Forally, it is easy to handle transfer learning using any learning algorith, by just ignoring the source data. In the other direction, if we allow repeated use of the target concept, then any transfer algorith can be used for conventional learning by repeatedly using the target data as the source. Thus, without the requireent, one cannot get eaningful transfer learning stateents for adversary factories. The nowledge that a set of tass was selected fro the sae K C, which is the subject of a K-factory, is the ain source of nowledge ade available during transfer learning. The second type of inforation arising fro transfer learning is that all + 1 distributions were sapled fro D. Using the face recognition exaple, there is the set of visual concepts B i,1 that captures the appearances of different furniture, and there is the set of visual concepts B i,2 that capture the characteristics of individual grasshoppers. Fro the source tass, we infer that the target concept c t belongs to a class of visual representations B i,3 that contains iage classifiers that are appropriate for odeling individual huan faces. For concreteness, we present our running exaple. A dis in R 3 around 0 is a binary classifier that has radius r and a hyperplane h and is defined as follows: { 1 if x h B(r) f r,h (x) = 0 if o.w Here, B(r) is the ball of radius r around 0 in R 3 : B(r) = {x R 3 : x r}. We define E h = { f r,h : r 0}, where, C = {E h : h hyperplane in R 3 around 0}. The following exaple deonstrates a specific K-factory on the hypothesis class defined above. Exaple 1. We consider a K-factory as follows: 1. the diss are selected arbitrarily with the sae K such that the source concepts differ fro the target concept, 2. D is supported with 3-D ultivariate Gaussians with ean µ = 0 and covariance C = σ 2 I, where σ is sapled uniforly in [1, 5. Copatibility between biases In the adversary odel, the source and target concepts are selected arbitrarily fro a subject bias K C. Therefore, we expect the ability of a bias B to approxiate K to be the worst case approxiation of a concept fro K using a concept fro B. Next, we foralize this relation that we call copatibility. We start with the utual error rate, which is easured by ɛ p (c 1, c 2 ) := ɛ d (c 1 ), where d = (p, c 2 ). A bias B would be highly copatible with the factory s subject K if for every selection of a target concept fro K, there is a good candidate in B. The copatibility of a bias B with respect to the subject K given an underlying distribution p over the instances set X is, therefore, defined as follows: Copatibility Error Rate = err p (B, K) := sup inf ɛ p(c 1, c 2 ) c 1 B In the adversary odel, the distributions for the tass are drawn fro the sae distribution of distributions. We, therefore, easure the generalization ris in the following anner: Generalization Copatibility Error Rate = E(B, K) = E p D [ errp (B, K) The epirical counterparts of these definitions are given, for an unlabeled dataset o drawn fro the distribution p as: where c 2 K Epirical Copatibility Error Rate = err o (B, K) = sup inf ɛ o(c 1, c 2 ) c 1 B Epirical Mutual Error Rate = ɛ o (c 1, c 2 ) := 1 o c 2 K l(c 2, c 1 (x)) In order to estiate E(B, K), the average of ultiple copatibility error rates is used. A set of unlabeled data sets O = (o 1,..., o ) is introduced, each corresponding to a different source tas, and the epirical copatibility error corresponding to the source data is easured by: Source Epirical Copatibility Error Rate = E O (B, K) = 1 err oi (B, K) 10 x o

3.2 The randoized factory This randoized factory was presented, as atrix sapling, by Baxter (2000). In their learning to learn wor transfer learning is not considered, and we odify the forulation to include the target tas d t. Definition 4 (Randoized factory). A randoized factory (or siply, factory when the context is clear) is a process D[,, n that: Step 1 Saples i.i.d + 1 tass d 1, d 2,.., d, d t E fro a distribution D. Step 2 Returns S = (s [1,, s t ) such that s i d i and s t d n t. The probabilistic nature of the the randoized factories allows the to fit a bias B by iniizing a suitable ris function. A natural choice of such a function is to easure the expected loss of B to approxiate a tas d with the following quantity, which we call Transfer Generalization Ris. ɛ D (B) := E d D [ɛ d (B) A reasonable approach to transfer learning is to first learn a bias B (in C ) that has a sall transfer generalization ris. Since we typically have liited access to saples fro the target tas, we often eploy the Source Generalization Ris instead: ɛ U (B) := 1 ɛ di (B), where U = {d 1,..., d } D[ The definition of an adversary factory assues that the concepts are selected arbitrarily, but without repetitions, fro the sae unnown hypothesis class K C. The randoized factory that saples the concepts according to soe distribution with the restriction of 0 probability for sapling the sae concept twice, could be considered a special case. Our randoized factory results do not assue this 0 probability criteria. Nevertheless, this is the usual situation and the one that is of the ost interest. 3.3 Transferability In this section, we provide general definitions of transfer learning. We follow the classical learning theory: defining a PAC notion of transfer learning and VC-lie diensions. We then introduce a faily of learning rules applicable for transfer learning. After describing the theory, we will turn to proving a fundaental Theore that states these are all equivalent to PAC-learnability. Definition 5 (PAC-transfer). A transfer learning setting T = (T, C, E) is PAC-transferable if: algorith A ɛ, δ 0, 0, n 0 (functions of ɛ, δ) > 0, > 0, n > n 0 D : P S D[,,n [ɛ dt (A(S )) inf ɛ d t (c) + ɛ 1 δ c H where A(S ) H is the output of the algorith, and ( 0, 0, n 0 ) are three functions of (ɛ, δ). This odel relies on the original PAC-learning odel. In the classical PAC odel, a learning algorith is introduced. The algorith saples enough labeled data exaples fro an arbitrary distribution, labeled by a target concept. The output is a hypothesis that has a high probability of classifying correctly a new exaple (sall error on the target tas), with high confidence. In our fraewor, the idea is siilar. In this case, the learner has access to exaples fro different tass. The learner s hope is to be able to coe up with useful coon nowledge, fro the source tass, for learning a new concept for the target tas. The output is a hypothesis that has a sall error on the target tas. In this case, the factory (chosen arbitrarily) provides the data saples. The ain assuption is that the distributions fro which the exaples are selected are sapled i.i.d fro the sae distribution of distributions. In any cases, we will provide a realizable assuption that all concepts share the sae representation (i.e, in the sae K C ). As we already entioned, this is the case when dealing with adversary factories. It provides coon nowledge between the concepts. The probabilistic assuption enables the algorith to transfer that useful inforation. 11

Next, we define VC-lie diensions for factory-based transfer learning. Unlie conventional VC diensions, which are purely cobinatorial, the suggested diensions are algorithic and probabilistic. This is because the post-transfer learning proble relies on inforation gained fro the source saples. Definition 6 (Transfer VC diension). T = (T, C, E) has transfer VC diension vc if: algorith N ɛ, δ 0, 0 (functions of ɛ, δ) > 0, > 0 D : P S D[, [vc(n(s )) vc and inf ɛ d t (c) inf ɛ d t (c) + ɛ 1 δ c N(S ) c H Here, N(S ) C is a hypothesis class. We say that the transfer VC diension is exactly d, if the above expression does not hold with d replaced with vc 1. The algorith N is called narrowing. These algoriths are special exaples of how the coon nowledge ight be extracted. In the first stage the algorith, that is provided with source data returns a narrow hypothesis class N(S ) that with a high probability approxiates very well on d t. N(S ) can be viewed as a learned representation of the tass fro D. Post-transfer, learning taes place in N(S ), where there exists a hypothesis that is ɛ-close to the best approxiation possible in H. In different situations, we will assue realizability, i.e, there exists B C such that D is supported by tass d that satisfy inf c B ɛ d (c) = 0. In the adversary case, each factory is a K-factory and, in particular, realizable. In the face recognition exaple, the deep learning algorith has access to face iages of ultiple huans. Fro this set of iages, a representation of an iage of huan faces is learned. Next, given the representation of huan faces, the learner selects a concept that best fits the target data in order to learn the specified huan face. By virtue of general VC theory, for learning a target tas, with enough target exaples (w.r.t. the capacity of N(S ) instead of H s capacity), one is able to output a hypothesis that is (ɛ + ɛ )-close to the best approxiation in H, where ɛ is the accuracy paraeter of the post-transfer learning algorith. We, therefore, define a 2-step progra. The first step applies narrowing and replaces H by a siplified hypothesis class B = N(s [1, ). The second step learns the target concept within B. An iediate special case, the T-ERM learning rule (transfer epirical ris iniization), uses an ERM rule as its second step. Put differently, Input: S = (s [1,, s t ). Output: concept c out such that ɛ dt (c out ) ɛ with probability 1 δ. Narrowing narrow the hypothesis class H B := N(s [1, ); Output c out = ERM B (s t ); Algorith 1: T-ERM learning rule In the following sections, when needed, we will use the following to denote the inial target saple coplexity of 2-step progras with n 2step (a function of ɛ, δ). We clai that whenever T is a learnable binary classification learning setting, once the narrowing is perfored, the ERM step is possible, with a nuber of saples that depend only on ɛ, δ. For this purpose, the following Lea is useful: Lea 1. The saple coplexity of any learnable binary classification hypothesis class H of VC diension u, is bounded by a universal function N u (ɛ, δ). i.e, it depends only on the VC diension. Proof. Siply by Theore 1. Based on Lea 1, with enough, (sufficient to apply the narrowing with error and confidence paraeters ɛ/2, δ/2) and n = N u (ɛ/2, δ/2) fro Lea 1, the T-ERM rule returns a concept that has error ɛ with probability 1 δ. 12

4 Results in the adversary odel In this section, we ae use of the adversary factories in order to present the equivalence between the different definitions of transferability discussed above for the binary case. Furtherore, it is shown that in this case, transferability is equivalent to PAC-learnability. In the next section, we study the advantages of transfer learning that exist despite this equivalence. 4.1 Transferability vs. learnability The binary classification case Theore 3. Let T = (T, C, E) be a binary classification transfer learning setting. The following conditions on T are then equivalent: 1. Has finite transfer diension. 2. Is PAC-transferable. 3. Is PAC-learnable. Next, we provide bounds on the target saple coplexity of 2-step progras. Corollary 1 (Quantitative results). When T = (T, C, E) is a binary classification transfer learning setting that has transfer VC diension vc, then the following holds: ( ) ( ) vc + log(1/δ) vc + log(1/δ) C 1 n 2step (ɛ, δ) C 2 ɛ For soe constants C 1, C 2 > 0. Proof. This corollary follows iediately fro the characterization above and is based on Bluer et al. (1989). We apply narrowing in order to narrow the class to VC diension v. The second step learns the hypothesis in the narrow subclass. The upper bound follows when the narrow subclass differs fro K (unrealizable case) while the lower bound turns in when K equals the narrow subclass (realizable case). The regression case We deonstrate that transferability does not iply PAC-learnability in regression probles. PAC-learning and PAC-transferability are well defined for regression using appropriate losses. As the following Lea shows, in the regression case, there is no siple equivalence between PAC-transferability and learnability. Lea 2. There is a transfer learning setting T PAC-learnable with squared loss l. ɛ 2 = (T, C, E) that is PAC-transferable but not While the exaple in the proof of Lea 2 (Appendix) is seeingly pathological, the scenario of non-learnability in regression is coon, for exaple, due to colinearity that gives rise to illconditioned learning probles. Having the ability to learn fro source tass reduces the abiguity. 4.2 Trivial and non-trivial transfer learning Transfer learning would be beneficial if it reduces the required target saple coplexities. We call this the non-trivial transfer property. It can also be said that a transfer learning setting T = (T, C, E) is non-trivial transferable, if there is a transfer learning algorith for it with a target saple coplexity saller than the saple coplexity of any learning algorith of H by a factor 0 < c < 1. An alternative definition (that is not equivalent) is saying that the VC transfer and regular VC diensions differ. We next describe a pathological case in which transfer is trivial, and then deonstrate the existence of non-trivial transfer. The pathological case can be deonstrated in the following siple exaple. Let H be the set of all 2D diss in R 3 around 0. Each E h contains the diss on the sae hyperplane h, for a finite collection of h. Consider the factory D that saples distributions d supported only by points fro 13

the hyperplanes h with a distance of at least 1 fro the origin. Since, in our odel, the concepts are selected arbitrarily, consider the case where all source concepts are diss with a radius saller than 1, and the target concept has a radius of 2. In the source data, all exaples are negative, and no inforation is gained on the hyperplane h. Despite the existence of the pathological case above, the following Lea clais the existence of non-trivial transferability. Lea 3. There exists a binary classification transfer learning setting T = (T, C, E) (i.e, T is a binary classification setting) that is non-trivial transferable. 4.3 Generalization bounds for adversary transfer learning In the previous section, we investigated the relationship between learnability and transferability. It was deonstrated that, in soe cases, there is non-trivial transferability. In such cases, transfer learning is beneficial and helps to reduce the size of the target data. In this section, we extend the discussion on non-trivial transferability. We focus on generalization bounds for transfer learning in the adversary odel. Two bounds are presented. The first bound is a VC-style bound. The second bound cobines both PAC-Bayesian and VC perspectives. The proposed bounds will shed soe light about representation learning and transfer learning in general. Nevertheless, despite the wide applicability of these generalization bounds, it will not be trivial to derive a transfer learning algorith fro the since they easure the difference between generalization and epirical copatibility error rates. In general, coputing the epirical copatibility error rate requires nowledge about the subject of the factory, which is ept unnown. Therefore, without additional assuptions it is intractable to copute this quantity. VC-style bounds for adversary transfer learning We extend the original VC generalization bound for the case of adversary transfer learning. We call the presented bound, The in-ax transfer learning bound. In this context, the in-ax stands for the copetition between the difficulty to approxiate K and the ability of a bias B to approxiate it. This bound estiates the expected worst case difference between the generalization copatibility of B to K and the epirical source copatiblity of B and K. The upper bound is the su of two regularization ters. The first penalizes both coplexities of B and K with respect to the nuber of saples per tas,. The second penalizes on the coplexity of C with respect to. The first step towards constructing a VC bound in the adversary odel is defining a growth function specialized for this setting. The otivation is controlling the copatiblity between a bias B and the subject K. Throughout the construction of copatibility easureents, the ost eleentary unit is the epirical error of B, in c1 B ɛ o (c 1, c 2 ) for soe c 2 K along to an unlabeled data set o. Instead of dealing with the whole bias B, we can focus only on c 1 = ERM B (c 2 (o)). In that way we can control the copatibility of B with K on the data set o. In transfer learning, we wish to control the joint error. Put differently, in the average of ultiple copatibility errors on different data sets. For this purpose, we count the nuber of different configurations of two concepts c i,1 and c i,2 on unlabeled data sets o i such that c i,1 = ERM B (c i,2 (o i )). To avoid notaional overload, we assue that the ERM is fixed, i.e., we assue an inner ipleentation of an ERM rule that taes a data set and returns a hypothesis for any selected bias B. Nevertheless, we do not restrict how it is ipleented. More forally, ERM B (s) represents a specific function that taes B, s and returns a hypothesis in B. Based on this bacground, we denote the following set of configurations: [H, C, K O = {(c 1,1 (o 1 ), c 1,2 (o 1 ),..., c,1 (o ), c,2 (o )) : c i,2 K and c i,1 = ERM B (c i,2 (o i )) s.t B C } In addition, the Adversarial Transfer Growth Function τ(, ; C, K), τ(, ; C, K) = ax O X [H, C, K O This quantity represents the worst case nuber of optional configurations. 14

Theore 4 (The in-ax transfer learning bound). Let T = (T, C, E) be a binary classification transfer learning setting. Then, D K C :E O D[, [sup E(B, K) E O (B, K) 4 + log(τ(2, ; C, K)) 2 + 4 + log(sup B τ B (2)) + log(τ K (2)) 2 PAC-Bayes bounds for adversary transfer learning This bound cobines between PAC- Bayesian and VC perspectives. We call it The perturbed in-ax transfer learning bound. This is because there is still a copetition between the ability of the bias to approxiate and the difficulty of the subject. Nevertheless, in this case the bias is perturbed. We tae a statistical relaxation of the standard odel. A set of posterior distributions Q and a prior distribution P, both over C are taen. Extending the discussion in Section 2, the ai is being able to select Q Q that best fit the data instead of a concrete bias. In this setting, we easure an expected version of the generalization copatibility error rate with B distributed by Q Q. We call it the Expected Generalization Copatibility Error Rate. Forally, E(Q, K) = E B Q E p D [ errp (B, K) It is iportant to note that the left hand side of the bound is, in general, intractable to copute. This is due to its direct dependence on K which is unnown. Nevertheless, there still ight be conditions in which different learning ethods do iniize this arguent. In addition, it gives insights on what a good bias is. Theore 5 (The perturbed in-ax transfer learning bound). Let T = (T, C, E) be a binary classification transfer learning setting. In addition, P a prior distribution and Q a faily of posterior distributions, both over C. Let δ (0, 1) and λ > 0, then for all factories D with probability 1 δ over the selection of O D[,, Q Q, K C : E(Q, K) 1 E B Q [err oi (B, K) 2 log(τh (2)) + log(8/λδ) + + 1 + KL(Q P) + log(2/δ) + λδ 2( 1) With the restriction that 8 log( 2 δ ) (λδ) 2. 5 Results in the randoized odel We start our discussion of randoized factories with the following Lea that revisits the diss in R 3 exaple above for this case. Lea 4. Let T = (T, C, E) be a realizable transfer learning setting such that H is the set of all 2D diss in R 3 around 0. Each E h = diss on the sae hyperplane h and C = {E h : h hyperplane in R 3 around 0}. This hypothesis class has transfer VC diension = 1 (and regular VC diension = 2). 5.1 Transferability vs. learnability A transferring rule or bias learner N is a function that aps a source data (i.e, s [1, ) into a bias B. An interesting special case is the siplifier. A siplifier fits B C that has a relatively sall error rate. Definition 7 (Siplifier). Let T = (T, C, E) be a transfer learning setting. An algorith N with access to C is called a siplifier if: ɛ, δ 0, 0 (functions of ɛ, δ) > 0, > 0 D : P S [ɛ D (N(S )) inf ɛ D(B) + ɛ 1 δ 15

Here, the source data S is sapled according to D[,. In addition, N(S ) C, which is a hypothesis class, is the result of applying the algorith to the source data. The quantities 0, 0 are functions of ɛ, δ. The standard ERM rule is next extended into a transferring rule. This rule returns a bias B that has the iniu error rate on the data, easured for each data set separately. This transferring rule, called C ERM C (S ) is defined as follows, 1 C ERM C (S ) := arg in ɛ si (c i,b ), s.t c i,b = ERM B(s i ) This transferring rule was previously considered by Ando et al. (2005) who naed it Joint ERM. Unifor convergence (Shalev-shwartz et al. (2010)) is defined for every hypothesis class H (w.r.t loss l), in the usual anner: [ d : P S d c H : ɛ d (c) ɛ s (c) ɛ 1 δ It can also be defined for C (w.r.t loss g) as: D : P U D[ [ B C : ɛ D (B) ɛ U (B) ɛ 1 δ For any larger than soe function (ɛ, δ). The following lea states that whenever both H and C have unifor convergence properties, then the C-ERM C transferring rule is a siplifier for H. Lea 5. Let T = (T, C, E) be a transfer learning setting. If both C and H have unifor convergence properties, then the C ERM rule is a siplifier of H. The preceeding Lea explained that C ERM C transferring rules are helpful for transferring nowledge efficiently. The next Theore states that even if the hypothesis class H has an infinite VC diension, there still ight be a siplifier outputing hypothesis classes with a finite VC diension. This result, however, could not be obtained when restricting the size of the data to be bounded by soe function of ɛ, δ. Therefore, in a sense, there is transferability beyond learnability. Theore 6. The following stateents hold on binary classification. There is a binary classification transfer learning setting T = (T, C, E) such that H has an infinite VC diension, has a siplifier N that always outputs a finite VC diensional bias B. If a binary classification transfer learning setting T = (T, C, E) has an infinite VC diension, then sup vc(b) =. 5.2 Generalization bounds for randoized transfer learning In the previous section, we explained that whenever both H and C have unifor convergence properties, there exists a siplifier for this transfer learning setting. We explained that, in this case, a C-ERM rule is an appropriate siplifier. In this section, we widen the discussion on the existence of a siplifier for different cases. For this purpose, we extend faous generalization bounds fro statistical learning theory to the case of transfer learning. VC-style bounds for transfer learning We begin with an extension of the original VC bound (see Section 2) to the case of transfer learning. It upper bounds the expected (w.r.t rando source data) difference between the transfer generalization ris of B and the 2-step Source Epirical Ris woring on B. A apping r : B r B fro a bias B to a learning rule of the bias (i.e, outputs hypotheses in B with epirical error that converge to inf c B ɛ d (c)) is called post transfer learning 16

rule/algorith (see Equation 2). Inforally, the 2-step source epirical ris easures the epirical success rate of a post transfer learning rule on a few data sets. The bounding quantity depends on the ability of the post transfer learning rule to generalize. The 2-step source epirical ris is forally defined as follows, ɛ S (B, r) := 1 ɛ si (r B (s i )), s.t S = s [1, Next, the standard constructions of the original VC bound are extended, [ H, C, r S = { r 1,B (s 1 ),..., r,b (s ) : B C }, s.t S = s [1, Here, r i,b := r B (s i ) denotes the application of the learning rule r B on s i and r i,b (s i ), the realization of r i,b on s i. The equivalent of the standard growth function in transfer learning is the transfer growth function, τ(,, r) = ax [ H, C, r S With this foralis, we can state our extended version of the VC bound. S Theore 7 (Transfer learning bound 1). Let T = (T, C, E) be a binary classification transfer learning setting such that T is learnable. In addition, assue that r is a post transfer learning rule, i.e, endowed with the following property, [ d, B : r B ( ) B and E s d inf ɛ d(c) ɛ s (r B (s)) ɛ() 0 (2) c B Then, E S D[, [sup ɛ D (B) ɛ S (B, r) 4 + log(τ(2,, r)) 2 + ɛ() We conclude that in binary classification, if T is learnable, and r satisfies Equation 2 then, there exists a siplifier whenever the transfer growth function τ(,, r) is polynoial in. PAC-Bayes bounds for transfer learning We provide two different PAC-Bayes bounds for transfer learning. The first bound estiates the gap between the generalization transfer ris of each B C and the average of the epirical riss of c i,b in the binary classification case. On the other hand, Theore 9 will argue a ore general case when H ight have an infinite VC diension or the underlying learning setting is not binary classification. The first approach concentrates on odel selection within PAC-Bayesian bounds. It presents a bound for odel selection that cobines PAC-Bayes and VC bounds. We construct a generalization bound to easure the fitting of a rando representation. i.e, the otivation is searching for Q that iniizes, R(Q) = E B Q E d D [inf ɛ d(c) c B Theore 8 (Transfer learning bound 2). Let T = (T, C, E) be a binary classification transfer learning setting. In addition, P a prior distribution and Q a faily of posterior distributions, both over C. Let δ (0, 1) and λ > 0, then with probability 1 δ over S, Q Q : R(Q) 1 + [ E B Q ɛsi (c i,b ) log(τh (2)) + log(8/λδ) With the restriction that 8 log(2/δ) (λδ) 2. + 1 + KL(Q P) + log(2/δ) + λδ 2( 1) 17