A Theoretical Framework for Deep Transfer Learning

Size: px
Start display at page:

Download "A Theoretical Framework for Deep Transfer Learning"

Transcription

1 A Theoretical Fraewor for Deep Transfer Learning Toer Galanti The School of Coputer Science Tel Aviv University Lior Wolf The School of Coputer Science Tel Aviv University Tair Hazan Faculty of Industrial Engineering & Manageent Technion Abstract We generalize the notion of PAC learning to include transfer learning. In our fraewor, the linage between the source and the target tass is a result of having the saple distribution of all classes drawn fro the sae distribution of distributions, and by restricting all source and a target concepts to belong to the sae hypothesis subclass. We have two odels: an adversary odel and a randoized odel. In the adversary odel, we show that for binary classification, conventional PAClearning is equivalent to the new notion of PAC-transfer and to transfer generalization of the VC-diension. For regression, we show that PAC-transferability ay exist even in the absence of PAC-learning. In the randoized odel, we provide PAC-Bayesian and VC-style generalization bounds to transfer learning, including bounds specifically derived for Deep Learning. A wide discussion on the tradeoffs between the different involved paraeters in the bounds is provided. We deonstrate both cases in which transfer does not reduce the saple size ( trivial transfer ) and cases in which the saple size is reduced ( non-trivial transfer ). 1 Introduction The advent of deep learning has helped proote the everyday use of transfer learning in a variety of learning probles. Representations, which are nothing ore than activations of the networ units at the deep layers, are used as general descriptors even though the networ paraeters were obtained while training a classifier on a specific set of classes under a specific saple distribution. As a result of the growing popularity of transferring deep learning representations, the need for a suitable theoretical fraewor has increased. In the transfer learning setting that we consider, there are source tass along with a target tas. The source tass are used to aid in the learning of the target tas. However, the loss of the source tass is not part of the learner s goal. As an illustrative exaple, consider the use of deep learning for the tas of face recognition. There are 7 billion classes, each corresponding to a person, and each has its own indicator function (classifier). Moreover, the distribution of the iages of each class is different. Soe individuals are photographed ore casually, while others are photographed in foral events. Soe are photographed ainly under bright illuination, while the iages of others are taen indoors. Hence, a coplete discussion of transfer learning has to tae into account both the classifiers and the distribution of the class saples. A deep face-recognition neural-networ is trained on a sall subset of the classes. For exaple, the DeepFace networ of Taigan et al. (2014) is trained using iages of only 4030 persons. The activations of the networ, at the layer just below the classification layer, are then used as a generic tool to represent any face, regardless of the iage distribution of that person s albu iages. 1

2 In this paper, we study a transferability fraewor, which is constructed to closely atch the theory of the learnable and its extensions including PAC learning (Valiant, 1984) and VC diension. A fundaental Theore of transfer learning, which lins these concepts in the context of transfer learning, is provided. We introduce the notion of a siplifier that has the ability to return a subclass that is a good approxiation of the original hypothesis class and is easier to learn. The conditions for the existence of a siplifier are discussed, and we show cases of transferability despite infinite VC diensions. PAC-Bayesian and VC bounds are derived, in particular for the case of Deep Learning. A few illustrative exaples deonstrate the echaniss of transferability. A cornerstone of our fraewor is the concept of a factory. Its role is to tie together the distributions of the source tass and the target tas without explicitly requiring the underlying distributions to be correlated or otherwise closely lined. The factory siply assues that the distribution of the target tas and the distributions of the source tass are drawn i.i.d fro the sae distribution of distributions. In the face recognition exaple above, the subset of individuals used to train the networ are a rando subset of the population fro which the target class (another individual) is also taen. The factory provides a subset of the population and a dataset corresponding to each person. The goal of the learner is to be able to learn efficiently how to recognize a new person s face using a relatively sall dataset of the new person s face iages. This idea generalizes the classic notion of learning in which the learner has access to a finite saple of exaples and its goal is to be able to classify wisely a new unseen exaple. 2

3 Table 1: Suary of notations ɛ, δ error rate and confidence paraeters (0, 1) X instances set Y labels set Z exaples set; usually X Y p a distribution d a tas (a distribution over Z) the nuber of source tass the nuber of saples for each source tas U a finite set of distributions; usually U = {d 1,..., d } or U = {p 1,..., p } E a set of distributions over X E an environent, a set of tass prob p (X) or p(x) the probability of a set X in the distribution p P, E the probability and expectation operators P[X Y, E[X Y the conditional probability and expectation D[K or just D a distribution over distributions (see Definitions 3, 4) K the subject of a factory s = {z 1,..., z } data of exaples i : z i Z S = (s [1,, s t ) source data sets s 1,..., s (of sae size) and one target data set s t o = {x 1,..., x } data of instances i : x i X O = (o [1,, o t ) data of of unlabeled source data sets o 1,..., o (of sae size) and one target data set o t S D[,, n data set S according to the factory D with sizes i [ : s i = and s t = n S D[, source data set S according to the factory D with sizes i [ : s i = U D[ set of tass of size taen fro D d D a tas taed fro D H a hypothesis class (in the supervised case, a set of functions X Y) c a concept; an ite of H C a hypothesis class faily; a set of subsets in H such that H = B B a bias; i.e, B C (and B H) N an algorith that outputs hypothesis classes A an algorith that outputs concepts r(s) the application of an algorith r on data s l : H Z R a loss function 0-1 loss l(c, (x, y)) = ([c(x) = y = true) squared loss l(c, (x, y)) = (c(x) y) 2 /2 T a learning setting; usually T = (H, Z, l) T PB a PAC-Bayes setting; usually T T B = (T, Q, p) T a transfer learning setting; usually T = (T, C, E) ɛ d (c) the generalization ris function = the expectation of l(c, z), i.e, E z d [l(c, z) ɛ s (c) the epirical ris function; ɛ s (c) = 1 s z s l(c, z) g : C E R the infiu ris g(b, d) = inf c B ɛ d (c) = inf{ɛ d (c) : c B} ɛ D (B) transfer generalization ris = E d D [g(b, d) ɛ U (B) source generalization ris = 1 U d U[g(B, d) ɛ s (B, r) 2-step epirical ris = ɛ s (r B (s)) ɛ S (B, r) 2-step source epirical ris = 1 [ɛ si (r B (s i )) R(q) randoized transfer ris = E B q [ɛ D (B) R U (q) randoized source generalization ris = E B q [ɛ U (B) KL(q p) KL-divergence, i.e, KL(q p) = E x q [log(q(x)/p(x)) ɛ p (c 1, c 2 ) the utual error rate; ɛ p (c 1, c 2 ) = ɛ (p,c1 )(c 2 ) ɛ o (c 1, c 2 ) the utual epirical error rate; ɛ o (c 1, c 2 ) = ɛ c1 (o)(c 2 ) err p (B, K) the copatibility error rate; err p (B, K) = sup c1 K inf c 2 B ɛ p (c 1, c 2 ) err o (B, K) the epirical copatibility error rate; err o (B, K) = sup c1 K inf c 2 B ɛ o (c 1, c 2 ) 3

4 Table 2: Suary of notations (continued) E U (B, K) the source copatibility error rate; E U (B, K) = 1 U p U err p (B, K) E(B, K) the generalization copatibility error rate; E(B, K) = E p D [err p (B, K) E O (B, K) the source epirical copatibility error rate; E O (B, K) = 1 O o O err o (B, K) h V,E,σ,w a neural networ with architecture (V, E, σ) and weights w : E R H V,E,σ set of all neural networs with architecture (V, E, σ) HV,E,σ I faily of all subsets of H V,E,σ deterined by fixing weights on I E E = I J a set of edges in a neural networ, I is the set of edges in the transfer architecture and J the rest of the edges (i.e, I J = ) H V,E, j,σ the architecture induced by (V, E, σ) when taing only the first j layers (see Section 6) ERM B (s) epirical ris iniizer; ERM B (s) = arg in c B ɛ s (c) C-ERM C (s [1, ) class epirical ris iniizer; 1 C-ERM C (s [1, ) = arg in in c B ɛ si (c) c i,b epirical ris iniizer in B for the i th data set; c i,b = ERM B(s i ) r i,b the application of a learner r B of B on s i ; r i,b = r B (s i ) u v concatenation of the vectors u, v 0 s a zeros vector of length s 1 a unit atrix N u (ɛ, δ) a universal bound on the saple coplexity for learning any hypothesis class of VC diension u E h the set of all diss around 0 that lie on the hyperplane h vc(h) the VC diension of the hypothesis class H τ H () the growth function of the hypothesis class H; i.e, τ H () = ax {x1,...,x } X {(c(x 1 ),..., c(x )) : c H} τ(,, r) the transfer growth function ofthe hypothesis class H; i.e, τ(,, r) = ax {s1,...,s } Z {(r 1,B (s 1 ),..., r,b (s )) : B C } τ(, ; C, K) the adversary transfer growth function; i.e, τ(, ; C, K) = ax {o1,...,o } X { c 1,1 (o 1 ), c 1,2,..., c,1 (o ), c,2 (o )) : c i,1 K and c i,2 = ERM B (c i,1 (o)) s.t B C } 4

5 2 Bacground In this part, a brief introduction of the bacground required is provided. The general learning fraewor, the PAC-Bayesian setting and deep learning are introduced. These subjects are used and extended in this wor. A reader who is failiar with these concepts, ay sip to the next sections. The general learning setting Recall the general learning setting proposed by Vapni (1995). This setting generalizes classification, regression, ulticlass classification, and several other learning settings. Definition 1. A learning setting T = (H, Z, l) is specified by, A hypothesis class H. An exaples set Z (with a siga-algebra). And a loss function l : H Z R. This approach helps to define supervised learning settings such as binary classification and regression in a foral and very clean way. Furtherore, in this fraewor, one can define learning scenarios when the concepts are not functions of exaples, but still have relations with exaples fro Z easured by loss functions (e.g, clustering, density estiation, etc.). If nothing else is entioned, T stands for a learning setting. We say that T is learnable if the corresponding H Is learnable. In addition, if H has a VC diension d, we say that T also has a VC diension d. With these notions, we present an extended transfer learning setting, as a special case of the general learning setting with a few changes. If a distribution d over Z is specified, the fitting of each c H is easured by a Generalization Ris, ɛ d (c) = E z d [l(c, z) Here, H, Z and l are nown to the learner. The distribution d is called a tas and is ept unnown. The goal of the learner is to pic c H that is closest to inf c H ɛ d (c). Since the distribution is unnown, this cannot be coputed directly and only approxiated using an epirical data set {z 1,..., z } selected i.i.d according to d. In any achine learning algoriths, the epirical ris function, ɛ s (c) = 1 z s l(c, z) has great ipact in the selection of the output hypothesis. Binary classification: Z = X {0, 1} and H consisting of c : X {0, 1} with l a 0-1 loss. Regression: Z = X Y where X and Y are bounded subsets of R n and R respectively. H is a set of bounded functions c : X R and l is any bounded function. One of the early breathroughs in statistical learning theory was the seinal wor of Vapni & Chervonenis (1971) and the later wor of Bluer et al. (1989), which characterized binary classification settings as learnable if and only if the VC diension is finite. The VC diension is the largest size required to ensure that there is a set of exaples (of that size) such that any configuration of labels on it is consistent with one of the functions in H. Their analysis was based on the growth function, τ H () = ax {c(x 1 ),..., c(x ) : c H}, where o = {x 1,..., x } o X A faous Lea due to Sauer (1972) asserts that whenever the VC diension of the hypothesis class H is finite, then the growth function is polynoial in, ( ) vc(h) e τ H () when > vc(h) (1) vc(h) Theore 1 (Vapni & Chervonenis (1971)). Let d be any distribution over an exaples set Z, H a hypothesis class and l : H Z {0, 1} be the 0-1 loss function. Then [ E s d sup ɛ d (c) ɛ s (c) 4 + log(τ H (2)) c H 2 In particular, whenever the growth function is polynoial then the generalization ris and epirical ris uniforly converge to each other. 5

6 PAC-Bayes setting The PAC-Bayesian bound due to McAllester (1998) describes the Expected Generalization Ris (or siply expected ris), i.e, the expectation of the generalization ris with respect to a distribution over the hypothesis class. The ai is not easuring the fitting of each hypothesis directly but to easure the fitting of different distributions (perturbations) over the hypothesis class. The expected ris is easured by E c q [ɛ d (c) and the Expected Epirical Ris is E c q [ɛ s (c), where s = {z 1,..., z } (satisfying E s d E c q [ɛ s (c) = E c q [ɛ d (c)). The PAC-Bayes bound estiates the expected ris with the expected epirical ris and a penalty ter which decreases as the size of the training data set grows. A prior distribution p dictating a hierarchy between the hypotheses in H is selected. The PAC-Bayesian bound penalizes the posterior selection of q by the relative entropy between q and p, easured by the Kullbac-Leibler divergence. Definition 2 (PAC-Bayes setting). A PAC-Bayes setting T PB = (T, Q, p) is specified by, A learning setting T = (H, Z, l). A set Q of posterior distributions q over H. A prior distribution p over H. The loss l is bounded in [0, 1. There are any variations of the PAC-Bayesian bound. Each of which has its own properties and advantages. In this wor we refer to the original bound due to McAllester (1998). Theore 2 (McAllester (1998)). Let d be any distribution over an exaple set Z, H a hypothesis class and l : H Z [0, 1 be a loss function. Let p be a distribution over H and Q a faily of distributions over H. Let δ (0, 1), then P s d q Q : E KL(q p) + log(/δ) c q[ɛ d (c) E c q [ɛ s (c) + 2( 1) 1 δ Where, KL(q p) = E c q [log(q(c)/p(c)). Deep learning A neural networ architecture (V, E, σ) is deterined by a set of neurons V, a set of directed edges E and an activation function σ : R R. In addition, a neural networ of a certain architecture is specified by a weight function w : E R. We denote H V,E,σ the hypothesis class consisting of all neural networs with architecture (V, E, σ). In this wor we will only consider feedforward neural networs, i.e., those with no directed cycles. In such networs, the neurons are organized in disjoint layers, V 0,..., V N, such that V = N V i. These functions have an output layer V N consisting of only one neuron and input layer V 0 holding the input and one constant neuron that always hold the value 1. The other layers are called hidden. A fully connected neural networ is a neural networ in which every neuron of layer V i is connected to every neuron of layer V i+1. The coputation done in feedforward neural networs is as follows: each neuron taes the outputs (x 1,..., x h ) of the neurons connected to it fro the previous layer and the weights on the edges connecting between the (w 1,..., w h ) and outputs: σ ( ) h w i x i, see Figure 1. The output of the entire networ is the value produced by the output neuron, see Figure 2. In this paper we give special attention for the sign activation function that returns 1 if the input is negative and 1 elsewise. The reason is that such neural networs are very expressive and are easier to analyse. Such networs define copound functions of half-spaces. Before we ove on to sections dealing with general purpose transferability and the special case of deep learning, we would lie to give soe insights on our interpretation of coon nowledge within neural networs. The classic approach to transfer learning in deep learning is done by shared weights. Concretely, soe weights are shared between neural networs of siilar architectures, each solving a different tas. We adopt the following notation, H I V,E,σ = {B u u : I R}, s.t B u = {h V,E,σ,w e I : w(e) = u(e)} and I E to denoe a faily of subclasses of the hypothesis class H V,E,σ, each deterined by a fixing of the weights on the edges in I E. We will also denote by J the copleent (i.e, I J = E and I J = ). This will be a cornersote in forulating shared paraeters between neural networs in 6

7 Inputs x 1 w 1 Activation function x 2 w 2 Σ σ Output x 3 w 3 Bias 1 w 4 Weights Figure 1: A neuron: four input values; x 1, x 2, x 3, 1, weights; w 1, w 2, w 3, w 4 and σ activation function. Input layer Hidden layer #1 Hidden layer #2 Input #1 Input #2 Input #3 Output Bias 1 Figure 2: A neural networ: feedforward fully connected neural networ with four input neurons and two hidden layers, each containing five neurons. transfer learning. For each B u, every two neural networs h 1, h 2 B u share the sae weights u on the edges in I E, see Figure 3. In ost practical cases the activation of a neuron is deterined by activations fro the previous layers by a set of edges that are either in I are do not intersect I. However, in this paper, for the PAC-Bayes setting of deep learning, the discussion is ept ore general, and activations can be deterined by both transfered weights and non-transfered weights. For VC-type bounds, the discussion is liited to the coon situation in which the architecture is decoposed into two parts: the transfer architecture and the specific architecture, i.e, h V,E,σ,u v = h 2 h 1 Where h 1 is a neural networ consisting of the first j layers and the edges between the (with potentially ore than one output) and h 2 has h 1 s output as input and produces the one output of the whole networ. With the previous notions, this tends to be the case where I consists of all edges between the first j layers, see Figure 4. In this case, the faily of hypothesis classes H I V,E,σ is viewed as a hypothesis class H t (transfer architecture) consisting of all transfer networs with the induced architecture. This hypothesis class consists of ulticlass hypotheses with instance space X = R V 0 and output space Y = { 1, 1} V j. H u serves as the specific architecture. Their decoposition consists of the neural networs in H V,E,σ, H u H t = {h 2 h 1 h 2 H u, h 1 H t } = H V,E,σ Each hypothesis class B H I V,E,σ is now treated as a neural networ h B with M := V j outputs and denote h B ( ) as its output. 7

8 Input layer Hidden layer #1 Hidden layer #2 Input #1 Input #2 Input #3 Output Input #4 Figure 3: A visualization of HV,E,σ I : I is the set of all the blue edges. Red edges are not transfered. Each bias B u HV,E,σ I is deterined by a fixed vector u consisting of the weights on the edges in I. Note that soe activations are fed by both blue and red edges. Input layer Hidden layer #1 Hidden layer #2 Input #1 Input #2 Input #3 Output Input #4 Figure 4: A decoposition into transfer and specific networs: the blue edges consist of the transfer networ and the red ones are the specific networ. 3 Proble setup In Section 1 we introduced transfer learning as a ultitas learning scenario with source tass and target tas. The learner is provided with data sets fro siilar (yet different) tass and the goal is to coe up with useful nowledge about the coonality between the tass. That way, learning the target tass would be easier, i.e., it would require saller data sets. In transfer learning, there are underlying and transfer probles. The underlying learning proble is the setting of each different learning proble. The transfer proble defines what is transferred during the learning process. We follow the foralis of Baxter (2000) with soe odifications. In our study, the underlying setting will be ost of the tie a realizable binary classification/regression setting with an instance set Z = X Y. The transfser setting T = (T, C, E) is specified by, A hypothesis class faily C, which is a set of subsets of H. With no loss of generality, we will assue that H = B. An environent E, which is a set of tass d. And an objective function g(b, d) = ɛ d (B) := inf c B ɛ d (c). Typically, B C. The transfer learner has access to tass {d i } {d t} (source and target) fro E. One approach to transfer learning is to coe up with B C that fits these tass well. The class B is called a bias. Learning a target tas d t ight require fewer training exaples, when B is learned successfully. In traditional achine learning, data points are sapled i.i.d according to a fixed distribution. In transfer learning, saples are generated by what we call, a Factory. A factory is a process that provides ultiple tass. We suggest two ajor types of factories, Adversary Factories and Ran- 8

9 Factory d 2 d 3... d d t d 1 z 1 1 z 1 2 z z 1... z t 1 z t 2 z t 3... z t n Figure 5: A factory: the sapling of saples z i j fro tass {d i} {d t}. First, the tass are selected either arbitrarily or randoly (depending on the factory type). The saple sets s i = {z i j } are then drawn fro the corresponding tass. doized Factories. The first generates supervised tass (i.e, distributions over Z = X Y). It selects concepts {c i } {c t} alost arbitrarily along to distributions over X. The other selects the tass randoly i.i.d fro a distribution. In Section 4, we ae use of adversary factories, while in Section 5 and Section 6 we use randoized factories instead. In both cases, Figure 5 deonstrates the process done by the factory in order to saple training data sets. The difference between the two types arises fro the ethod used to select the tass. 3.1 The adversary factory A factory selects source tass and a target tas that the learner is tested on. An adversary factory is a type of factory that selects supervised tass (i.e, distributions over Z = X Y). It selects source concepts {c i } that differ fro the target concept c t and are otherwise chosen arbitrarily. The factory also saples i.i.d distributions over X, {p i } {p t}, fro the distribution of distributions D. By the supervised behaviour of the learning setting, we have E = H E where E is a set of distributions over X. Definition 3 (Adversary factory). A factory D[,, n is a process with paraeters [,, n that: Step 1 Selects + 1 tass d 1,..., d, d t such that d i = (p i, c i ) E in the following anner, Saples i.i.d + 1 distributions p 1, p 2,.., p, p t fro a distribution of distributions D. Selects any source concepts c 1, c 2,.., c and one target concept c t out of H such that i [ : c i c t. Step 2 Returns S = (s [1,, s t ) such that s i d i and s t d n t. Notation wise, if n = 0, we will write D[, instead of D[,, 0. When = n = 0, we will siply write D[. To avoid sybol overload, siilar notions will be used to denote randoized factories, depending on the section. For a data set S sapled according to an adversary factory, we denote with O = (o 1,..., o, o t ) the original data set without the labels. This data set is a saple according to o i p i and o t p n t where p 1,..., p, p t D +1. In Section 4, all factories are adversary, while in the following sections they are randoized. A K-factory is a factory that selects all concepts fro K C. K is said to be the subject of the factory. In this paper, we assue that all adversary factories are K-factories for soe unnown bias K C. The sybol K will be preserved to denote the subject a factory. We will often write, in Section 4 and the associated proofs, S D[,, n. This is a slight abuse of notation since the concepts are not saples. It would ean that the clai is true for any selection of the concepts. In soe sense, we can assue that there is an underlying unnown arbitrary selection of concepts, and the data is sapled with respect to the. To avoid overload of notations, we will write O D[,, n to denote the corresponding unlabeled data set version of S. The requireent that c t differs fro c i for all i [ is essential to this first odel. Intuitively, the interesting cases are those in which the target concept was not encountered in the source tass. 9

10 Forally, it is easy to handle transfer learning using any learning algorith, by just ignoring the source data. In the other direction, if we allow repeated use of the target concept, then any transfer algorith can be used for conventional learning by repeatedly using the target data as the source. Thus, without the requireent, one cannot get eaningful transfer learning stateents for adversary factories. The nowledge that a set of tass was selected fro the sae K C, which is the subject of a K-factory, is the ain source of nowledge ade available during transfer learning. The second type of inforation arising fro transfer learning is that all + 1 distributions were sapled fro D. Using the face recognition exaple, there is the set of visual concepts B i,1 that captures the appearances of different furniture, and there is the set of visual concepts B i,2 that capture the characteristics of individual grasshoppers. Fro the source tass, we infer that the target concept c t belongs to a class of visual representations B i,3 that contains iage classifiers that are appropriate for odeling individual huan faces. For concreteness, we present our running exaple. A dis in R 3 around 0 is a binary classifier that has radius r and a hyperplane h and is defined as follows: { 1 if x h B(r) f r,h (x) = 0 if o.w Here, B(r) is the ball of radius r around 0 in R 3 : B(r) = {x R 3 : x r}. We define E h = { f r,h : r 0}, where, C = {E h : h hyperplane in R 3 around 0}. The following exaple deonstrates a specific K-factory on the hypothesis class defined above. Exaple 1. We consider a K-factory as follows: 1. the diss are selected arbitrarily with the sae K such that the source concepts differ fro the target concept, 2. D is supported with 3-D ultivariate Gaussians with ean µ = 0 and covariance C = σ 2 I, where σ is sapled uniforly in [1, 5. Copatibility between biases In the adversary odel, the source and target concepts are selected arbitrarily fro a subject bias K C. Therefore, we expect the ability of a bias B to approxiate K to be the worst case approxiation of a concept fro K using a concept fro B. Next, we foralize this relation that we call copatibility. We start with the utual error rate, which is easured by ɛ p (c 1, c 2 ) := ɛ d (c 1 ), where d = (p, c 2 ). A bias B would be highly copatible with the factory s subject K if for every selection of a target concept fro K, there is a good candidate in B. The copatibility of a bias B with respect to the subject K given an underlying distribution p over the instances set X is, therefore, defined as follows: Copatibility Error Rate = err p (B, K) := sup inf ɛ p(c 1, c 2 ) c 1 B In the adversary odel, the distributions for the tass are drawn fro the sae distribution of distributions. We, therefore, easure the generalization ris in the following anner: Generalization Copatibility Error Rate = E(B, K) = E p D [ errp (B, K) The epirical counterparts of these definitions are given, for an unlabeled dataset o drawn fro the distribution p as: where c 2 K Epirical Copatibility Error Rate = err o (B, K) = sup inf ɛ o(c 1, c 2 ) c 1 B Epirical Mutual Error Rate = ɛ o (c 1, c 2 ) := 1 o c 2 K l(c 2, c 1 (x)) In order to estiate E(B, K), the average of ultiple copatibility error rates is used. A set of unlabeled data sets O = (o 1,..., o ) is introduced, each corresponding to a different source tas, and the epirical copatibility error corresponding to the source data is easured by: Source Epirical Copatibility Error Rate = E O (B, K) = 1 err oi (B, K) 10 x o

11 3.2 The randoized factory This randoized factory was presented, as atrix sapling, by Baxter (2000). In their learning to learn wor transfer learning is not considered, and we odify the forulation to include the target tas d t. Definition 4 (Randoized factory). A randoized factory (or siply, factory when the context is clear) is a process D[,, n that: Step 1 Saples i.i.d + 1 tass d 1, d 2,.., d, d t E fro a distribution D. Step 2 Returns S = (s [1,, s t ) such that s i d i and s t d n t. The probabilistic nature of the the randoized factories allows the to fit a bias B by iniizing a suitable ris function. A natural choice of such a function is to easure the expected loss of B to approxiate a tas d with the following quantity, which we call Transfer Generalization Ris. ɛ D (B) := E d D [ɛ d (B) A reasonable approach to transfer learning is to first learn a bias B (in C ) that has a sall transfer generalization ris. Since we typically have liited access to saples fro the target tas, we often eploy the Source Generalization Ris instead: ɛ U (B) := 1 ɛ di (B), where U = {d 1,..., d } D[ The definition of an adversary factory assues that the concepts are selected arbitrarily, but without repetitions, fro the sae unnown hypothesis class K C. The randoized factory that saples the concepts according to soe distribution with the restriction of 0 probability for sapling the sae concept twice, could be considered a special case. Our randoized factory results do not assue this 0 probability criteria. Nevertheless, this is the usual situation and the one that is of the ost interest. 3.3 Transferability In this section, we provide general definitions of transfer learning. We follow the classical learning theory: defining a PAC notion of transfer learning and VC-lie diensions. We then introduce a faily of learning rules applicable for transfer learning. After describing the theory, we will turn to proving a fundaental Theore that states these are all equivalent to PAC-learnability. Definition 5 (PAC-transfer). A transfer learning setting T = (T, C, E) is PAC-transferable if: algorith A ɛ, δ 0, 0, n 0 (functions of ɛ, δ) > 0, > 0, n > n 0 D : P S D[,,n [ɛ dt (A(S )) inf ɛ d t (c) + ɛ 1 δ c H where A(S ) H is the output of the algorith, and ( 0, 0, n 0 ) are three functions of (ɛ, δ). This odel relies on the original PAC-learning odel. In the classical PAC odel, a learning algorith is introduced. The algorith saples enough labeled data exaples fro an arbitrary distribution, labeled by a target concept. The output is a hypothesis that has a high probability of classifying correctly a new exaple (sall error on the target tas), with high confidence. In our fraewor, the idea is siilar. In this case, the learner has access to exaples fro different tass. The learner s hope is to be able to coe up with useful coon nowledge, fro the source tass, for learning a new concept for the target tas. The output is a hypothesis that has a sall error on the target tas. In this case, the factory (chosen arbitrarily) provides the data saples. The ain assuption is that the distributions fro which the exaples are selected are sapled i.i.d fro the sae distribution of distributions. In any cases, we will provide a realizable assuption that all concepts share the sae representation (i.e, in the sae K C ). As we already entioned, this is the case when dealing with adversary factories. It provides coon nowledge between the concepts. The probabilistic assuption enables the algorith to transfer that useful inforation. 11

12 Next, we define VC-lie diensions for factory-based transfer learning. Unlie conventional VC diensions, which are purely cobinatorial, the suggested diensions are algorithic and probabilistic. This is because the post-transfer learning proble relies on inforation gained fro the source saples. Definition 6 (Transfer VC diension). T = (T, C, E) has transfer VC diension vc if: algorith N ɛ, δ 0, 0 (functions of ɛ, δ) > 0, > 0 D : P S D[, [vc(n(s )) vc and inf ɛ d t (c) inf ɛ d t (c) + ɛ 1 δ c N(S ) c H Here, N(S ) C is a hypothesis class. We say that the transfer VC diension is exactly d, if the above expression does not hold with d replaced with vc 1. The algorith N is called narrowing. These algoriths are special exaples of how the coon nowledge ight be extracted. In the first stage the algorith, that is provided with source data returns a narrow hypothesis class N(S ) that with a high probability approxiates very well on d t. N(S ) can be viewed as a learned representation of the tass fro D. Post-transfer, learning taes place in N(S ), where there exists a hypothesis that is ɛ-close to the best approxiation possible in H. In different situations, we will assue realizability, i.e, there exists B C such that D is supported by tass d that satisfy inf c B ɛ d (c) = 0. In the adversary case, each factory is a K-factory and, in particular, realizable. In the face recognition exaple, the deep learning algorith has access to face iages of ultiple huans. Fro this set of iages, a representation of an iage of huan faces is learned. Next, given the representation of huan faces, the learner selects a concept that best fits the target data in order to learn the specified huan face. By virtue of general VC theory, for learning a target tas, with enough target exaples (w.r.t. the capacity of N(S ) instead of H s capacity), one is able to output a hypothesis that is (ɛ + ɛ )-close to the best approxiation in H, where ɛ is the accuracy paraeter of the post-transfer learning algorith. We, therefore, define a 2-step progra. The first step applies narrowing and replaces H by a siplified hypothesis class B = N(s [1, ). The second step learns the target concept within B. An iediate special case, the T-ERM learning rule (transfer epirical ris iniization), uses an ERM rule as its second step. Put differently, Input: S = (s [1,, s t ). Output: concept c out such that ɛ dt (c out ) ɛ with probability 1 δ. Narrowing narrow the hypothesis class H B := N(s [1, ); Output c out = ERM B (s t ); Algorith 1: T-ERM learning rule In the following sections, when needed, we will use the following to denote the inial target saple coplexity of 2-step progras with n 2step (a function of ɛ, δ). We clai that whenever T is a learnable binary classification learning setting, once the narrowing is perfored, the ERM step is possible, with a nuber of saples that depend only on ɛ, δ. For this purpose, the following Lea is useful: Lea 1. The saple coplexity of any learnable binary classification hypothesis class H of VC diension u, is bounded by a universal function N u (ɛ, δ). i.e, it depends only on the VC diension. Proof. Siply by Theore 1. Based on Lea 1, with enough, (sufficient to apply the narrowing with error and confidence paraeters ɛ/2, δ/2) and n = N u (ɛ/2, δ/2) fro Lea 1, the T-ERM rule returns a concept that has error ɛ with probability 1 δ. 12

13 4 Results in the adversary odel In this section, we ae use of the adversary factories in order to present the equivalence between the different definitions of transferability discussed above for the binary case. Furtherore, it is shown that in this case, transferability is equivalent to PAC-learnability. In the next section, we study the advantages of transfer learning that exist despite this equivalence. 4.1 Transferability vs. learnability The binary classification case Theore 3. Let T = (T, C, E) be a binary classification transfer learning setting. The following conditions on T are then equivalent: 1. Has finite transfer diension. 2. Is PAC-transferable. 3. Is PAC-learnable. Next, we provide bounds on the target saple coplexity of 2-step progras. Corollary 1 (Quantitative results). When T = (T, C, E) is a binary classification transfer learning setting that has transfer VC diension vc, then the following holds: ( ) ( ) vc + log(1/δ) vc + log(1/δ) C 1 n 2step (ɛ, δ) C 2 ɛ For soe constants C 1, C 2 > 0. Proof. This corollary follows iediately fro the characterization above and is based on Bluer et al. (1989). We apply narrowing in order to narrow the class to VC diension v. The second step learns the hypothesis in the narrow subclass. The upper bound follows when the narrow subclass differs fro K (unrealizable case) while the lower bound turns in when K equals the narrow subclass (realizable case). The regression case We deonstrate that transferability does not iply PAC-learnability in regression probles. PAC-learning and PAC-transferability are well defined for regression using appropriate losses. As the following Lea shows, in the regression case, there is no siple equivalence between PAC-transferability and learnability. Lea 2. There is a transfer learning setting T PAC-learnable with squared loss l. ɛ 2 = (T, C, E) that is PAC-transferable but not While the exaple in the proof of Lea 2 (Appendix) is seeingly pathological, the scenario of non-learnability in regression is coon, for exaple, due to colinearity that gives rise to illconditioned learning probles. Having the ability to learn fro source tass reduces the abiguity. 4.2 Trivial and non-trivial transfer learning Transfer learning would be beneficial if it reduces the required target saple coplexities. We call this the non-trivial transfer property. It can also be said that a transfer learning setting T = (T, C, E) is non-trivial transferable, if there is a transfer learning algorith for it with a target saple coplexity saller than the saple coplexity of any learning algorith of H by a factor 0 < c < 1. An alternative definition (that is not equivalent) is saying that the VC transfer and regular VC diensions differ. We next describe a pathological case in which transfer is trivial, and then deonstrate the existence of non-trivial transfer. The pathological case can be deonstrated in the following siple exaple. Let H be the set of all 2D diss in R 3 around 0. Each E h contains the diss on the sae hyperplane h, for a finite collection of h. Consider the factory D that saples distributions d supported only by points fro 13

14 the hyperplanes h with a distance of at least 1 fro the origin. Since, in our odel, the concepts are selected arbitrarily, consider the case where all source concepts are diss with a radius saller than 1, and the target concept has a radius of 2. In the source data, all exaples are negative, and no inforation is gained on the hyperplane h. Despite the existence of the pathological case above, the following Lea clais the existence of non-trivial transferability. Lea 3. There exists a binary classification transfer learning setting T = (T, C, E) (i.e, T is a binary classification setting) that is non-trivial transferable. 4.3 Generalization bounds for adversary transfer learning In the previous section, we investigated the relationship between learnability and transferability. It was deonstrated that, in soe cases, there is non-trivial transferability. In such cases, transfer learning is beneficial and helps to reduce the size of the target data. In this section, we extend the discussion on non-trivial transferability. We focus on generalization bounds for transfer learning in the adversary odel. Two bounds are presented. The first bound is a VC-style bound. The second bound cobines both PAC-Bayesian and VC perspectives. The proposed bounds will shed soe light about representation learning and transfer learning in general. Nevertheless, despite the wide applicability of these generalization bounds, it will not be trivial to derive a transfer learning algorith fro the since they easure the difference between generalization and epirical copatibility error rates. In general, coputing the epirical copatibility error rate requires nowledge about the subject of the factory, which is ept unnown. Therefore, without additional assuptions it is intractable to copute this quantity. VC-style bounds for adversary transfer learning We extend the original VC generalization bound for the case of adversary transfer learning. We call the presented bound, The in-ax transfer learning bound. In this context, the in-ax stands for the copetition between the difficulty to approxiate K and the ability of a bias B to approxiate it. This bound estiates the expected worst case difference between the generalization copatibility of B to K and the epirical source copatiblity of B and K. The upper bound is the su of two regularization ters. The first penalizes both coplexities of B and K with respect to the nuber of saples per tas,. The second penalizes on the coplexity of C with respect to. The first step towards constructing a VC bound in the adversary odel is defining a growth function specialized for this setting. The otivation is controlling the copatiblity between a bias B and the subject K. Throughout the construction of copatibility easureents, the ost eleentary unit is the epirical error of B, in c1 B ɛ o (c 1, c 2 ) for soe c 2 K along to an unlabeled data set o. Instead of dealing with the whole bias B, we can focus only on c 1 = ERM B (c 2 (o)). In that way we can control the copatibility of B with K on the data set o. In transfer learning, we wish to control the joint error. Put differently, in the average of ultiple copatibility errors on different data sets. For this purpose, we count the nuber of different configurations of two concepts c i,1 and c i,2 on unlabeled data sets o i such that c i,1 = ERM B (c i,2 (o i )). To avoid notaional overload, we assue that the ERM is fixed, i.e., we assue an inner ipleentation of an ERM rule that taes a data set and returns a hypothesis for any selected bias B. Nevertheless, we do not restrict how it is ipleented. More forally, ERM B (s) represents a specific function that taes B, s and returns a hypothesis in B. Based on this bacground, we denote the following set of configurations: [H, C, K O = {(c 1,1 (o 1 ), c 1,2 (o 1 ),..., c,1 (o ), c,2 (o )) : c i,2 K and c i,1 = ERM B (c i,2 (o i )) s.t B C } In addition, the Adversarial Transfer Growth Function τ(, ; C, K), τ(, ; C, K) = ax O X [H, C, K O This quantity represents the worst case nuber of optional configurations. 14

15 Theore 4 (The in-ax transfer learning bound). Let T = (T, C, E) be a binary classification transfer learning setting. Then, D K C :E O D[, [sup E(B, K) E O (B, K) 4 + log(τ(2, ; C, K)) log(sup B τ B (2)) + log(τ K (2)) 2 PAC-Bayes bounds for adversary transfer learning This bound cobines between PAC- Bayesian and VC perspectives. We call it The perturbed in-ax transfer learning bound. This is because there is still a copetition between the ability of the bias to approxiate and the difficulty of the subject. Nevertheless, in this case the bias is perturbed. We tae a statistical relaxation of the standard odel. A set of posterior distributions Q and a prior distribution P, both over C are taen. Extending the discussion in Section 2, the ai is being able to select Q Q that best fit the data instead of a concrete bias. In this setting, we easure an expected version of the generalization copatibility error rate with B distributed by Q Q. We call it the Expected Generalization Copatibility Error Rate. Forally, E(Q, K) = E B Q E p D [ errp (B, K) It is iportant to note that the left hand side of the bound is, in general, intractable to copute. This is due to its direct dependence on K which is unnown. Nevertheless, there still ight be conditions in which different learning ethods do iniize this arguent. In addition, it gives insights on what a good bias is. Theore 5 (The perturbed in-ax transfer learning bound). Let T = (T, C, E) be a binary classification transfer learning setting. In addition, P a prior distribution and Q a faily of posterior distributions, both over C. Let δ (0, 1) and λ > 0, then for all factories D with probability 1 δ over the selection of O D[,, Q Q, K C : E(Q, K) 1 E B Q [err oi (B, K) 2 log(τh (2)) + log(8/λδ) KL(Q P) + log(2/δ) + λδ 2( 1) With the restriction that 8 log( 2 δ ) (λδ) 2. 5 Results in the randoized odel We start our discussion of randoized factories with the following Lea that revisits the diss in R 3 exaple above for this case. Lea 4. Let T = (T, C, E) be a realizable transfer learning setting such that H is the set of all 2D diss in R 3 around 0. Each E h = diss on the sae hyperplane h and C = {E h : h hyperplane in R 3 around 0}. This hypothesis class has transfer VC diension = 1 (and regular VC diension = 2). 5.1 Transferability vs. learnability A transferring rule or bias learner N is a function that aps a source data (i.e, s [1, ) into a bias B. An interesting special case is the siplifier. A siplifier fits B C that has a relatively sall error rate. Definition 7 (Siplifier). Let T = (T, C, E) be a transfer learning setting. An algorith N with access to C is called a siplifier if: ɛ, δ 0, 0 (functions of ɛ, δ) > 0, > 0 D : P S [ɛ D (N(S )) inf ɛ D(B) + ɛ 1 δ 15

16 Here, the source data S is sapled according to D[,. In addition, N(S ) C, which is a hypothesis class, is the result of applying the algorith to the source data. The quantities 0, 0 are functions of ɛ, δ. The standard ERM rule is next extended into a transferring rule. This rule returns a bias B that has the iniu error rate on the data, easured for each data set separately. This transferring rule, called C ERM C (S ) is defined as follows, 1 C ERM C (S ) := arg in ɛ si (c i,b ), s.t c i,b = ERM B(s i ) This transferring rule was previously considered by Ando et al. (2005) who naed it Joint ERM. Unifor convergence (Shalev-shwartz et al. (2010)) is defined for every hypothesis class H (w.r.t loss l), in the usual anner: [ d : P S d c H : ɛ d (c) ɛ s (c) ɛ 1 δ It can also be defined for C (w.r.t loss g) as: D : P U D[ [ B C : ɛ D (B) ɛ U (B) ɛ 1 δ For any larger than soe function (ɛ, δ). The following lea states that whenever both H and C have unifor convergence properties, then the C-ERM C transferring rule is a siplifier for H. Lea 5. Let T = (T, C, E) be a transfer learning setting. If both C and H have unifor convergence properties, then the C ERM rule is a siplifier of H. The preceeding Lea explained that C ERM C transferring rules are helpful for transferring nowledge efficiently. The next Theore states that even if the hypothesis class H has an infinite VC diension, there still ight be a siplifier outputing hypothesis classes with a finite VC diension. This result, however, could not be obtained when restricting the size of the data to be bounded by soe function of ɛ, δ. Therefore, in a sense, there is transferability beyond learnability. Theore 6. The following stateents hold on binary classification. There is a binary classification transfer learning setting T = (T, C, E) such that H has an infinite VC diension, has a siplifier N that always outputs a finite VC diensional bias B. If a binary classification transfer learning setting T = (T, C, E) has an infinite VC diension, then sup vc(b) =. 5.2 Generalization bounds for randoized transfer learning In the previous section, we explained that whenever both H and C have unifor convergence properties, there exists a siplifier for this transfer learning setting. We explained that, in this case, a C-ERM rule is an appropriate siplifier. In this section, we widen the discussion on the existence of a siplifier for different cases. For this purpose, we extend faous generalization bounds fro statistical learning theory to the case of transfer learning. VC-style bounds for transfer learning We begin with an extension of the original VC bound (see Section 2) to the case of transfer learning. It upper bounds the expected (w.r.t rando source data) difference between the transfer generalization ris of B and the 2-step Source Epirical Ris woring on B. A apping r : B r B fro a bias B to a learning rule of the bias (i.e, outputs hypotheses in B with epirical error that converge to inf c B ɛ d (c)) is called post transfer learning 16

17 rule/algorith (see Equation 2). Inforally, the 2-step source epirical ris easures the epirical success rate of a post transfer learning rule on a few data sets. The bounding quantity depends on the ability of the post transfer learning rule to generalize. The 2-step source epirical ris is forally defined as follows, ɛ S (B, r) := 1 ɛ si (r B (s i )), s.t S = s [1, Next, the standard constructions of the original VC bound are extended, [ H, C, r S = { r 1,B (s 1 ),..., r,b (s ) : B C }, s.t S = s [1, Here, r i,b := r B (s i ) denotes the application of the learning rule r B on s i and r i,b (s i ), the realization of r i,b on s i. The equivalent of the standard growth function in transfer learning is the transfer growth function, τ(,, r) = ax [ H, C, r S With this foralis, we can state our extended version of the VC bound. S Theore 7 (Transfer learning bound 1). Let T = (T, C, E) be a binary classification transfer learning setting such that T is learnable. In addition, assue that r is a post transfer learning rule, i.e, endowed with the following property, [ d, B : r B ( ) B and E s d inf ɛ d(c) ɛ s (r B (s)) ɛ() 0 (2) c B Then, E S D[, [sup ɛ D (B) ɛ S (B, r) 4 + log(τ(2,, r)) 2 + ɛ() We conclude that in binary classification, if T is learnable, and r satisfies Equation 2 then, there exists a siplifier whenever the transfer growth function τ(,, r) is polynoial in. PAC-Bayes bounds for transfer learning We provide two different PAC-Bayes bounds for transfer learning. The first bound estiates the gap between the generalization transfer ris of each B C and the average of the epirical riss of c i,b in the binary classification case. On the other hand, Theore 9 will argue a ore general case when H ight have an infinite VC diension or the underlying learning setting is not binary classification. The first approach concentrates on odel selection within PAC-Bayesian bounds. It presents a bound for odel selection that cobines PAC-Bayes and VC bounds. We construct a generalization bound to easure the fitting of a rando representation. i.e, the otivation is searching for Q that iniizes, R(Q) = E B Q E d D [inf ɛ d(c) c B Theore 8 (Transfer learning bound 2). Let T = (T, C, E) be a binary classification transfer learning setting. In addition, P a prior distribution and Q a faily of posterior distributions, both over C. Let δ (0, 1) and λ > 0, then with probability 1 δ over S, Q Q : R(Q) 1 + [ E B Q ɛsi (c i,b ) log(τh (2)) + log(8/λδ) With the restriction that 8 log(2/δ) (λδ) KL(Q P) + log(2/δ) + λδ 2( 1) 17

1 Proof of learning bounds

1 Proof of learning bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic

More information

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

1 Generalization bounds based on Rademacher complexity

1 Generalization bounds based on Rademacher complexity COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges

More information

Understanding Machine Learning Solution Manual

Understanding Machine Learning Solution Manual Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, 2014 2 Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y

More information

1 Rademacher Complexity Bounds

1 Rademacher Complexity Bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic

More information

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where

More information

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay

More information

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition

More information

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

Machine Learning Basics: Estimators, Bias and Variance

Machine Learning Basics: Estimators, Bias and Variance Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 11 10/15/2008 ABSTRACT INTEGRATION I

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 11 10/15/2008 ABSTRACT INTEGRATION I MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 11 10/15/2008 ABSTRACT INTEGRATION I Contents 1. Preliinaries 2. The ain result 3. The Rieann integral 4. The integral of a nonnegative

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016 Lessons 7 14 Dec 2016 Outline Artificial Neural networks Notation...2 1. Introduction...3... 3 The Artificial

More information

Feature Extraction Techniques

Feature Extraction Techniques Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

PAC-Bayes Analysis Of Maximum Entropy Learning

PAC-Bayes Analysis Of Maximum Entropy Learning PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E

More information

Learnability and Stability in the General Learning Setting

Learnability and Stability in the General Learning Setting Learnability and Stability in the General Learning Setting Shai Shalev-Shwartz TTI-Chicago shai@tti-c.org Ohad Shair The Hebrew University ohadsh@cs.huji.ac.il Nathan Srebro TTI-Chicago nati@uchicago.edu

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lessons 7 20 Dec 2017 Outline Artificial Neural networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

1 Proving the Fundamental Theorem of Statistical Learning

1 Proving the Fundamental Theorem of Statistical Learning THEORETICAL MACHINE LEARNING COS 5 LECTURE #7 APRIL 5, 6 LECTURER: ELAD HAZAN NAME: FERMI MA ANDDANIEL SUO oving te Fundaental Teore of Statistical Learning In tis section, we prove te following: Teore.

More information

Block designs and statistics

Block designs and statistics Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung

More information

Testing Properties of Collections of Distributions

Testing Properties of Collections of Distributions Testing Properties of Collections of Distributions Reut Levi Dana Ron Ronitt Rubinfeld April 9, 0 Abstract We propose a fraework for studying property testing of collections of distributions, where the

More information

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical IEEE TRANSACTIONS ON INFORMATION THEORY Large Alphabet Source Coding using Independent Coponent Analysis Aichai Painsky, Meber, IEEE, Saharon Rosset and Meir Feder, Fellow, IEEE arxiv:67.7v [cs.it] Jul

More information

Handout 7. and Pr [M(x) = χ L (x) M(x) =? ] = 1.

Handout 7. and Pr [M(x) = χ L (x) M(x) =? ] = 1. Notes on Coplexity Theory Last updated: October, 2005 Jonathan Katz Handout 7 1 More on Randoized Coplexity Classes Reinder: so far we have seen RP,coRP, and BPP. We introduce two ore tie-bounded randoized

More information

Supplement to: Subsampling Methods for Persistent Homology

Supplement to: Subsampling Methods for Persistent Homology Suppleent to: Subsapling Methods for Persistent Hoology A. Technical results In this section, we present soe technical results that will be used to prove the ain theores. First, we expand the notation

More information

arxiv: v1 [cs.ds] 3 Feb 2014

arxiv: v1 [cs.ds] 3 Feb 2014 arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning JMLR: Workshop and Conference Proceedings vol (1) 1 15 New Bounds for Learning Intervals with Iplications for Sei-Supervised Learning David P. Helbold dph@soe.ucsc.edu Departent of Coputer Science, University

More information

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search Quantu algoriths (CO 781, Winter 2008) Prof Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search ow we begin to discuss applications of quantu walks to search algoriths

More information

STOPPING SIMULATED PATHS EARLY

STOPPING SIMULATED PATHS EARLY Proceedings of the 2 Winter Siulation Conference B.A.Peters,J.S.Sith,D.J.Medeiros,andM.W.Rohrer,eds. STOPPING SIMULATED PATHS EARLY Paul Glasseran Graduate School of Business Colubia University New Yor,

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

Non-Parametric Non-Line-of-Sight Identification 1

Non-Parametric Non-Line-of-Sight Identification 1 Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval

Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval Unifor Approxiation and Bernstein Polynoials with Coefficients in the Unit Interval Weiang Qian and Marc D. Riedel Electrical and Coputer Engineering, University of Minnesota 200 Union St. S.E. Minneapolis,

More information

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks Intelligent Systes: Reasoning and Recognition Jaes L. Crowley MOSIG M1 Winter Seester 2018 Lesson 7 1 March 2018 Outline Artificial Neural Networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

Robustness and Regularization of Support Vector Machines

Robustness and Regularization of Support Vector Machines Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA

More information

Estimating Parameters for a Gaussian pdf

Estimating Parameters for a Gaussian pdf Pattern Recognition and achine Learning Jaes L. Crowley ENSIAG 3 IS First Seester 00/0 Lesson 5 7 Noveber 00 Contents Estiating Paraeters for a Gaussian pdf Notation... The Pattern Recognition Proble...3

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

3.8 Three Types of Convergence

3.8 Three Types of Convergence 3.8 Three Types of Convergence 3.8 Three Types of Convergence 93 Suppose that we are given a sequence functions {f k } k N on a set X and another function f on X. What does it ean for f k to converge to

More information

Probability Distributions

Probability Distributions Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples

More information

The Weierstrass Approximation Theorem

The Weierstrass Approximation Theorem 36 The Weierstrass Approxiation Theore Recall that the fundaental idea underlying the construction of the real nubers is approxiation by the sipler rational nubers. Firstly, nubers are often deterined

More information

Bayes Decision Rule and Naïve Bayes Classifier

Bayes Decision Rule and Naïve Bayes Classifier Bayes Decision Rule and Naïve Bayes Classifier Le Song Machine Learning I CSE 6740, Fall 2013 Gaussian Mixture odel A density odel p(x) ay be ulti-odal: odel it as a ixture of uni-odal distributions (e.g.

More information

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

e-companion ONLY AVAILABLE IN ELECTRONIC FORM OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer

More information

COS 424: Interacting with Data. Written Exercises

COS 424: Interacting with Data. Written Exercises COS 424: Interacting with Data Hoework #4 Spring 2007 Regression Due: Wednesday, April 18 Written Exercises See the course website for iportant inforation about collaboration and late policies, as well

More information

Physics 215 Winter The Density Matrix

Physics 215 Winter The Density Matrix Physics 215 Winter 2018 The Density Matrix The quantu space of states is a Hilbert space H. Any state vector ψ H is a pure state. Since any linear cobination of eleents of H are also an eleent of H, it

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Journal of Machine Learning Research 5 (2004) 529-547 Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago

More information

Polygonal Designs: Existence and Construction

Polygonal Designs: Existence and Construction Polygonal Designs: Existence and Construction John Hegean Departent of Matheatics, Stanford University, Stanford, CA 9405 Jeff Langford Departent of Matheatics, Drake University, Des Moines, IA 5011 G

More information

The Wilson Model of Cortical Neurons Richard B. Wells

The Wilson Model of Cortical Neurons Richard B. Wells The Wilson Model of Cortical Neurons Richard B. Wells I. Refineents on the odgkin-uxley Model The years since odgkin s and uxley s pioneering work have produced a nuber of derivative odgkin-uxley-like

More information

New upper bound for the B-spline basis condition number II. K. Scherer. Institut fur Angewandte Mathematik, Universitat Bonn, Bonn, Germany.

New upper bound for the B-spline basis condition number II. K. Scherer. Institut fur Angewandte Mathematik, Universitat Bonn, Bonn, Germany. New upper bound for the B-spline basis condition nuber II. A proof of de Boor's 2 -conjecture K. Scherer Institut fur Angewandte Matheati, Universitat Bonn, 535 Bonn, Gerany and A. Yu. Shadrin Coputing

More information

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation journal of coplexity 6, 459473 (2000) doi:0.006jco.2000.0544, available online at http:www.idealibrary.co on On the Counication Coplexity of Lipschitzian Optiization for the Coordinated Model of Coputation

More information

CS Lecture 13. More Maximum Likelihood

CS Lecture 13. More Maximum Likelihood CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood

More information

Supervised assessment: Modelling and problem-solving task

Supervised assessment: Modelling and problem-solving task Matheatics C 2008 Saple assessent instruent and indicative student response Supervised assessent: Modelling and proble-solving tas This saple is intended to infor the design of assessent instruents in

More information

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs On the Inapproxiability of Vertex Cover on k-partite k-unifor Hypergraphs Venkatesan Guruswai and Rishi Saket Coputer Science Departent Carnegie Mellon University Pittsburgh, PA 1513. Abstract. Coputing

More information

Support recovery in compressed sensing: An estimation theoretic approach

Support recovery in compressed sensing: An estimation theoretic approach Support recovery in copressed sensing: An estiation theoretic approach Ain Karbasi, Ali Horati, Soheil Mohajer, Martin Vetterli School of Coputer and Counication Sciences École Polytechnique Fédérale de

More information

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES Vol. 57, No. 3, 2009 Algoriths for parallel processor scheduling with distinct due windows and unit-tie obs A. JANIAK 1, W.A. JANIAK 2, and

More information

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t. CS 493: Algoriths for Massive Data Sets Feb 2, 2002 Local Models, Bloo Filter Scribe: Qin Lv Local Models In global odels, every inverted file entry is copressed with the sae odel. This work wells when

More information

On Constant Power Water-filling

On Constant Power Water-filling On Constant Power Water-filling Wei Yu and John M. Cioffi Electrical Engineering Departent Stanford University, Stanford, CA94305, U.S.A. eails: {weiyu,cioffi}@stanford.edu Abstract This paper derives

More information

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods

More information

3.3 Variational Characterization of Singular Values

3.3 Variational Characterization of Singular Values 3.3. Variational Characterization of Singular Values 61 3.3 Variational Characterization of Singular Values Since the singular values are square roots of the eigenvalues of the Heritian atrices A A and

More information

Fairness via priority scheduling

Fairness via priority scheduling Fairness via priority scheduling Veeraruna Kavitha, N Heachandra and Debayan Das IEOR, IIT Bobay, Mubai, 400076, India vavitha,nh,debayan}@iitbacin Abstract In the context of ulti-agent resource allocation

More information

arxiv: v1 [cs.ds] 17 Mar 2016

arxiv: v1 [cs.ds] 17 Mar 2016 Tight Bounds for Single-Pass Streaing Coplexity of the Set Cover Proble Sepehr Assadi Sanjeev Khanna Yang Li Abstract arxiv:1603.05715v1 [cs.ds] 17 Mar 2016 We resolve the space coplexity of single-pass

More information

Hybrid System Identification: An SDP Approach

Hybrid System Identification: An SDP Approach 49th IEEE Conference on Decision and Control Deceber 15-17, 2010 Hilton Atlanta Hotel, Atlanta, GA, USA Hybrid Syste Identification: An SDP Approach C Feng, C M Lagoa, N Ozay and M Sznaier Abstract The

More information

Introduction to Machine Learning. Recitation 11

Introduction to Machine Learning. Recitation 11 Introduction to Machine Learning Lecturer: Regev Schweiger Recitation Fall Seester Scribe: Regev Schweiger. Kernel Ridge Regression We now take on the task of kernel-izing ridge regression. Let x,...,

More information

Randomized Recovery for Boolean Compressed Sensing

Randomized Recovery for Boolean Compressed Sensing Randoized Recovery for Boolean Copressed Sensing Mitra Fatei and Martin Vetterli Laboratory of Audiovisual Counication École Polytechnique Fédéral de Lausanne (EPFL) Eail: {itra.fatei, artin.vetterli}@epfl.ch

More information

Reduced Length Checking Sequences

Reduced Length Checking Sequences Reduced Length Checing Sequences Robert M. Hierons 1 and Hasan Ural 2 1 Departent of Inforation Systes and Coputing, Brunel University, Middlesex, UB8 3PH, United Kingdo 2 School of Inforation echnology

More information

Least Squares Fitting of Data

Least Squares Fitting of Data Least Squares Fitting of Data David Eberly, Geoetric Tools, Redond WA 98052 https://www.geoetrictools.co/ This work is licensed under the Creative Coons Attribution 4.0 International License. To view a

More information

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA) Bayesian Learning Chapter 6: Bayesian Learning CS 536: Machine Learning Littan (Wu, TA) [Read Ch. 6, except 6.3] [Suggested exercises: 6.1, 6.2, 6.6] Bayes Theore MAP, ML hypotheses MAP learners Miniu

More information

Statistics and Probability Letters

Statistics and Probability Letters Statistics and Probability Letters 79 2009 223 233 Contents lists available at ScienceDirect Statistics and Probability Letters journal hoepage: www.elsevier.co/locate/stapro A CLT for a one-diensional

More information

VC Dimension and Sauer s Lemma

VC Dimension and Sauer s Lemma CMSC 35900 (Spring 2008) Learning Theory Lecture: VC Diension and Sauer s Lea Instructors: Sha Kakade and Abuj Tewari Radeacher Averages and Growth Function Theore Let F be a class of ±-valued functions

More information

On the Use of A Priori Information for Sparse Signal Approximations

On the Use of A Priori Information for Sparse Signal Approximations ITS TECHNICAL REPORT NO. 3/4 On the Use of A Priori Inforation for Sparse Signal Approxiations Oscar Divorra Escoda, Lorenzo Granai and Pierre Vandergheynst Signal Processing Institute ITS) Ecole Polytechnique

More information

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential

More information

arxiv: v2 [math.co] 3 Dec 2008

arxiv: v2 [math.co] 3 Dec 2008 arxiv:0805.2814v2 [ath.co] 3 Dec 2008 Connectivity of the Unifor Rando Intersection Graph Sion R. Blacburn and Stefanie Gere Departent of Matheatics Royal Holloway, University of London Egha, Surrey TW20

More information

Upper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition

Upper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition Upper bound on false alar rate for landine detection and classification using syntactic pattern recognition Ahed O. Nasif, Brian L. Mark, Kenneth J. Hintz, and Nathalia Peixoto Dept. of Electrical and

More information

Convex Programming for Scheduling Unrelated Parallel Machines

Convex Programming for Scheduling Unrelated Parallel Machines Convex Prograing for Scheduling Unrelated Parallel Machines Yossi Azar Air Epstein Abstract We consider the classical proble of scheduling parallel unrelated achines. Each job is to be processed by exactly

More information

Bipartite subgraphs and the smallest eigenvalue

Bipartite subgraphs and the smallest eigenvalue Bipartite subgraphs and the sallest eigenvalue Noga Alon Benny Sudaov Abstract Two results dealing with the relation between the sallest eigenvalue of a graph and its bipartite subgraphs are obtained.

More information

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material Consistent Multiclass Algoriths for Coplex Perforance Measures Suppleentary Material Notations. Let λ be the base easure over n given by the unifor rando variable (say U over n. Hence, for all easurable

More information

Support Vector Machines. Maximizing the Margin

Support Vector Machines. Maximizing the Margin Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the

More information

Improved Guarantees for Agnostic Learning of Disjunctions

Improved Guarantees for Agnostic Learning of Disjunctions Iproved Guarantees for Agnostic Learning of Disjunctions Pranjal Awasthi Carnegie Mellon University pawasthi@cs.cu.edu Avri Blu Carnegie Mellon University avri@cs.cu.edu Or Sheffet Carnegie Mellon University

More information

Support Vector Machines. Goals for the lecture

Support Vector Machines. Goals for the lecture Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed

More information

Randomized Accuracy-Aware Program Transformations For Efficient Approximate Computations

Randomized Accuracy-Aware Program Transformations For Efficient Approximate Computations Randoized Accuracy-Aware Progra Transforations For Efficient Approxiate Coputations Zeyuan Allen Zhu Sasa Misailovic Jonathan A. Kelner Martin Rinard MIT CSAIL zeyuan@csail.it.edu isailo@it.edu kelner@it.edu

More information

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words) 1 A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine (1900 words) Contact: Jerry Farlow Dept of Matheatics Univeristy of Maine Orono, ME 04469 Tel (07) 866-3540 Eail: farlow@ath.uaine.edu

More information

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions Tight Inforation-Theoretic Lower Bounds for Welfare Maxiization in Cobinatorial Auctions Vahab Mirrokni Jan Vondrák Theory Group, Microsoft Dept of Matheatics Research Princeton University Redond, WA 9805

More information

Analyzing Simulation Results

Analyzing Simulation Results Analyzing Siulation Results Dr. John Mellor-Cruey Departent of Coputer Science Rice University johnc@cs.rice.edu COMP 528 Lecture 20 31 March 2005 Topics for Today Model verification Model validation Transient

More information

Lower Bounds for Quantized Matrix Completion

Lower Bounds for Quantized Matrix Completion Lower Bounds for Quantized Matrix Copletion Mary Wootters and Yaniv Plan Departent of Matheatics University of Michigan Ann Arbor, MI Eail: wootters, yplan}@uich.edu Mark A. Davenport School of Elec. &

More information

Chapter 6 1-D Continuous Groups

Chapter 6 1-D Continuous Groups Chapter 6 1-D Continuous Groups Continuous groups consist of group eleents labelled by one or ore continuous variables, say a 1, a 2,, a r, where each variable has a well- defined range. This chapter explores:

More information

Estimating Entropy and Entropy Norm on Data Streams

Estimating Entropy and Entropy Norm on Data Streams Estiating Entropy and Entropy Nor on Data Streas Ait Chakrabarti 1, Khanh Do Ba 1, and S. Muthukrishnan 2 1 Departent of Coputer Science, Dartouth College, Hanover, NH 03755, USA 2 Departent of Coputer

More information

Soft-margin SVM can address linearly separable problems with outliers

Soft-margin SVM can address linearly separable problems with outliers Non-linear Support Vector Machines Non-linearly separable probles Hard-argin SVM can address linearly separable probles Soft-argin SVM can address linearly separable probles with outliers Non-linearly

More information

arxiv: v3 [cs.lg] 7 Jan 2016

arxiv: v3 [cs.lg] 7 Jan 2016 Efficient and Parsionious Agnostic Active Learning Tzu-Kuo Huang Alekh Agarwal Daniel J. Hsu tkhuang@icrosoft.co alekha@icrosoft.co djhsu@cs.colubia.edu John Langford Robert E. Schapire jcl@icrosoft.co

More information

A note on the multiplication of sparse matrices

A note on the multiplication of sparse matrices Cent. Eur. J. Cop. Sci. 41) 2014 1-11 DOI: 10.2478/s13537-014-0201-x Central European Journal of Coputer Science A note on the ultiplication of sparse atrices Research Article Keivan Borna 12, Sohrab Aboozarkhani

More information

ZISC Neural Network Base Indicator for Classification Complexity Estimation

ZISC Neural Network Base Indicator for Classification Complexity Estimation ZISC Neural Network Base Indicator for Classification Coplexity Estiation Ivan Budnyk, Abdennasser Сhebira and Kurosh Madani Iages, Signals and Intelligent Systes Laboratory (LISSI / EA 3956) PARIS XII

More information

arxiv: v1 [math.nt] 14 Sep 2014

arxiv: v1 [math.nt] 14 Sep 2014 ROTATION REMAINDERS P. JAMESON GRABER, WASHINGTON AND LEE UNIVERSITY 08 arxiv:1409.411v1 [ath.nt] 14 Sep 014 Abstract. We study properties of an array of nubers, called the triangle, in which each row

More information

Ensemble Based on Data Envelopment Analysis

Ensemble Based on Data Envelopment Analysis Enseble Based on Data Envelopent Analysis So Young Sohn & Hong Choi Departent of Coputer Science & Industrial Systes Engineering, Yonsei University, Seoul, Korea Tel) 82-2-223-404, Fax) 82-2- 364-7807

More information

Iterative Decoding of LDPC Codes over the q-ary Partial Erasure Channel

Iterative Decoding of LDPC Codes over the q-ary Partial Erasure Channel 1 Iterative Decoding of LDPC Codes over the q-ary Partial Erasure Channel Rai Cohen, Graduate Student eber, IEEE, and Yuval Cassuto, Senior eber, IEEE arxiv:1510.05311v2 [cs.it] 24 ay 2016 Abstract In

More information

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence Best Ar Identification: A Unified Approach to Fixed Budget and Fixed Confidence Victor Gabillon Mohaad Ghavazadeh Alessandro Lazaric INRIA Lille - Nord Europe, Tea SequeL {victor.gabillon,ohaad.ghavazadeh,alessandro.lazaric}@inria.fr

More information

Bounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks

Bounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks Bounds on the Miniax Rate for Estiating a Prior over a VC Class fro Independent Learning Tasks Liu Yang Steve Hanneke Jaie Carbonell Deceber 01 CMU-ML-1-11 School of Coputer Science Carnegie Mellon University

More information