Twisting Sample Observations with Population Properties to learn

Size: px
Start display at page:

Download "Twisting Sample Observations with Population Properties to learn"

Transcription

1 Twisting Sample Observations with Population Properties to learn B. APOLLONI, S. BASSIS, S. GAITO and D. MALCHIODI Dipartimento di Scienze dell Informazione Università degli Studi di Milano Via Comelico 39/ Milano ITALY Abstract: We introduce a theoretic framework for Probably Approximately Correct learning. This enables us to compute the distribution law of the random variable representing the probability of region where the hypothesis is incorrect. The distinguishing feature in respect to the inference of an analogous probability from Bernoulli variable is the dependence of this distribution on a complexity parameter playing a companion role of the Vapnik-Chervonenkis dimension. Key-Words: Computational learning, statistical inference, twisting argument. 1 Introduction A very innovative aspect of PAC learning [7] is to assume that probabilities are random variables per se. This is not the point of confidence intervals in classic statistical theory, where randomness is due to the extremes of the intervals rather than to the value of the probabilistic parameter at hand. For instance, let us consider a Bernoulli variable X and the inequality min{ T P(θ T) 1 δ} (1) identifying a 1-δ confidence interval for the parameter θ = P(X=1). Here T is a function f of a random sample (X 1,..., X m ), the minimum is taken over some free parameters of f and the randomness of the event in the brackets only derives from the randomness of the sample. In this statistical framework typical PAC learning inequalities [8] reserve some subtle aspects. In respect to a class C of Boolean functions {g α } and two elements of it representing the target concept c and its approximating hypothesis h, consider the event B = ( sup α Δ R(α) ν(α) < ε) (2) for a given ε, where Δ is a set of reals, α indexes candidate h s within C, R (α) is a probability measure of a related symmetric difference c h and ν(α) the sample frequency of falling in this domain. If we fix ν(α)=0, the probabilistic features of this event comes from the randomness of the set Δ collecting hypotheses whose symmetric difference with c have no points of a random sample inside. However, if in order to approximate from the above the probability of this event we enlarge Δ to the whole set of indices of C, we refer at an event with no sense in the classical statistical theory. On the contrary it makes sense requiring P(B ν(α) = 0) = P( sup R(α) < ε) = P(R(α * ) < ε) >1 δ α Δ (3) for some α * Δ if we assume the probability measure on the error domain related to α * to be, in turn, a random variable. Right from the start, the object of our inference is a string of data X (possibly of infinite length) that we partition into a prefix we assume to be known at present (and therefore call sample) and a suffix of unknown future data we call a population (see Fig. 1). All these data share the feature of being independent observations of the same phenomenon. Therefore without loss of generality we assume these data as the output of some function g θ having input from a set of independent random variables U uniformly distributed in the unit interval. By default, u p X Sample Population Fig. 1. Generating a sample of Bernoulli variables. Independent variables: x-axis: index of the U realizations; y-axis: both U (bars) and X (bullets) values. The threshold line p realizes a mapping from U to X through (4).

2 capital letters (such as U, X) will denote random variables and small letters (u, x) their corresponding realizations. We will refer to M=(U, g θ ) as a sampling mechanism and to g θ as an explaining function, and this function is precisely the object of our inference. Let us consider, for instance, the sample mechanism M=(U, g p ), where g p (u) = 1 if u p 0 otherwise (4) explains sample and population distributed according to a Bernoulli law of mean p. As shown in Fig. 1, for a given sequence of U s we obtain different binary strings depending on the height of the threshold line corresponding to p. Thus it is easy to desume the following implication chain ( k p k) p < p ( ) ( k p k +1) (5) and the consequent bound of the probability ( ) Ρ( P < p ) = F P ( p ) Ρ( K p k +1) (6) Ρ K p k which characterizes the cumulative distribution function (c.d.f.) F P of the parameter P, representing the asymptotic frequency of 1 in the population compatible with the number k of 1 in the sample. Here k denotes the number of 1 in the sample and K p denotes the random variable counting the number of 1 in the sample if the threshold in the explaining function switches to p for the same realizations of U. With reference to the probability space ( Ω, Σ, P), where Ω is the [0, 1] interval, Σ the related sigmaalgebra and P is uniformly distributed on Ω, P in (6) refers to the product space of the sample and population of U s in Ω. The sample and population for the Bernoulli variable are just a function of them. The probabilities regarding the statistic come from the marginalization of the joint distribution with respect to the population, while the distribution of the parameter from the marginalization with respect to the sample. Note the asymmetry in the implications. It derives from the fact that raising the threshold parameter in g p cannot decrease the number of 1 in the observed sample, but we can recognize that such a raising occurred only if we really see a number of ones in the sample greater than k. We will refer to every expression similar to (5) as a twisting argument [4], since it allows us to exchange events on parameters with events on statistics. Its peculiarity lies in the fact that, since the first and last probability in (6) are completely known, we are able to identify the distribution law of the unknown parameter. A more thorough discussion of this tool and of the related algorithmic inference framework can be found in [4]. The principal use we will make of relations like (6) is to compute confidence intervals for the unknown parameter P. Namely, in the above inference framework it does not make sense to assume a specific random variable as given; rather we refer to families of random variables with some free parameters whose distribution law is discovered from a sample through twisting arguments. Nevertheless for both conciseness sake and advisability of linking our results with conventional ones, we will often keep referring by abuse of notation to a given random variable X, yet with random parameters. Within this notation we match our notion of sample, as a prefix of a string of data, with the usual definition of it as a specification of a set of identically distributed random variables. Example [Inferring a probability]. Let X denote a random variable distributed according to a Bernoulli law of mean P, (X 1,,X m ) a sample of size m from X and k=σx i the sum of 1 in a specification of the sample. A symmetric confidence interval of level δ for P is (l i,l s ) where l i is the δ/2 quantile of the Beta distribution of parameters k and m k+1 [6], and l s is the analogous 1 δ /2 quantile for parameters k+1 and m k. Indeed, consider the explanation of X given by (4) and the twisting argument (5). In this case K p ~ follows a Binomial distribution law of parameters m and p ~, so that (6) reads m m p i ( 1 p ) m i F P p i= k ( ) m m p i ( 1 p ) m i (7) i= k +1 Having introduced the incomplete Beta function I β as the c.d.f. of the random variable Be(h,r) following a Beta distribution of parameters h and r, that is I β (h,r) Ρ( Be( h,r) β) =1 h 1 h + r 1 β i ( 1 β) h +r 1 i (8) i= 0 the above bounds can be written as ( ) F P ( p ) I p ( k +1,m k) (9) I p k,m k +1

3 Therefore, getting δ I = I ls ( k +1,m k) I li k,m k +1 ( ) (10) as a lower bound to F P (l s )- F P (l i ) =P(l i < P < l s ), the desired confidence interval can be found by dividing the probability measure outside (l i,l s ) in two equal parts in order to obtain a two-sided interval symmetric in the tail probabilities. We thus obtain the extremes of the interval as the solutions l i and l s of the equations system A more intensive experiment would show that, in the approximation of h/200 with the asymptotic frequency of ones in the suffixes of the first 20 sampled values, on all samples and even for each sample if we draw many suffixes of the same one, almost 100(1-δ) percent of the frets fall within the analytically computed curves. I ls ( k +1,m k) =1 δ /2 (11) I li k,m k +1 ( ) = δ /2 (12) To check the effectiveness of this computation we considered a string of unitary uniform variables representing, respectively, the randomness source of a sample and a population of Bernoulli variables. Then according to the explaining function (4) we computed a sequence of Bernoullian 220 bits long vectors with p rising from 0 to 1. The trajectory described by the point of coordinates k/20 and h/200, computing the frequency of ones in the sample and in the population respectively, is reported along one fret line in Fig. 2. We repeated this experiment 20 times (each time using different vectors of uniform variables). Then we drew on the same graph the solutions of equations (11-12) with respect to l i and l s with varying k for δ =0.1. As we can see, for a given value of k the intercepts of the above curves with a vertical line with abscissa k/20 determine an interval containing almost all intercepts of the frets with the same line. Fig. 2. Generating 0.9 confidence intervals for the mean P of a Bernoulli random variable with population and sample of n=200 and m=20 elements, respectively. φ=k/m = frequency of ones in the sample; ψ=h/n = frequency of ones in the population. Fret lines: trajectories described by the number of 1 in sample and population when p ranges from 0 to 1, for different sets of initial uniform random variables. Curves: trajectories described by the interval extremes when the observed number k of 1 in the sample ranges from 0 to m. Fig. 3. Functionally linked variables: Circle h describing the sample and possible circles describing the population. Small diamonds and circles: sampled points; Line filled region: symmetric difference. 2 Algorithmic inference of Boolean functions In the PAC learning framework the parameter under investigation is the probability that the inferred function will compute erroneously on next inputs (will not explain new sampled points). In greater detail, the general form of the sample is Z m ={(X i, b i ), i=1,..., m} (13) where b i are Boolean variables. If we assume that for every M and every Z M an f exists in a Boolean class C, call it c, such that Z M ={(X i, c(x i )), i=1,..., M}, then we are interested in the measure of the symmetric difference between another function computed from Z m, that we denote as hypothesis h, and any such c (see Fig. 3). The peculiarity of this inference problem is that some degrees of freedom of our sample are burned by the functional links on the labels. Namely, let us denote by U c h the measure of the above symmetric difference: for a given z m this is the random variable corresponding to the parameter R(α) for a suitable mapping from {h} to Δ. Then the twisting argument reads (with some caveats): (T ε t U c h +1) (U c h < ε) (T ε t U c h + µ) (14)

4 where t Uc h is the number of actual sample points falling in c h (the empirical risk in the Vapnik notation [8]), T ε the analogous statistic for an enlargement of c h of measure ε, and µ is a new complexity measure directly referred to C. The threshold in the left inequality is due to the fact that h is a function A of a sample specification z m in its own turn, so that if A is such that the symmetric difference grows with the set of included sample points and vice versa then (U c h <ε) implies that any enlargement region containing c h must violate the label of at least one more of the sampled points at the basis of h s computation. Quantity µ is an upper bound to the number of sample points sufficient to witness an eventual increase of U c h after a new hypothesis containing c h has been generated. Its extension to the whole class of concepts C and class of hypotheses H is called detail D C,H. Although semantically different from the VC dimension [5], when H coincides with C, this complexity index is related to the latter by the following theorem: Theorem [1]: Denoting by d VC (C) the VC dimension of a concept class C with detail D C,C, (d VC (C)-1)/176<D C,C <(d VC (C)+1) (15) Theorem [2]: Assume we are given a concept class C on a space, a sample z m drawn from Z m as in (13), a learning function 1 A:{z m } C. Consider the family of sets {c h} with c C labeling z m, h=a(z m ) and its detail D C,C =µ, misclassifying at least t and at most t points of probability π (0, 1), and denote with U c h the random variable given by the probability measure of c h and by F Uc h its c.d.f. Then for each z m and β (π, 1) I β ( 1+ t',m t' ) F U c h ( β) I β ( µ + t,m ( µ + t)+1) (16) where µ +t 1 m I β (µ + t,m (µ + t) +1) =1 β i (1 β) m i (17) i= 0 is the incomplete Beta function. Corollary: Within the same hypotheses of the above theorem: 1. the ratio between maximum and minimum numbers of examples needed to learn C with 1 Satisfying usual regularity conditions which represent the counterpart of a well behaved function [5] request. For a formal definition see [1]. accuracy parameters 0<ε<1/8, 0<δ<1/100 is bounded by a constant [1]; 2. a pair of extremes (l i, l s ) of a confidence interval of level δ for U c h is constituted respectively by the δ/2 quantile of the Beta distribution of parameters 1+t and m-t, and the analogous 1-δ/2 quantile for parameters µ+t and m-(µ+t)+1 [4]; 3. class complexity µ and hypothesis accuracy t linearly add [3]. Example [Learning rectangles]: Consider h belonging to the class of rectangles. We move from the unidimensional case of Fig. 2 to the bidimensional case depicted in Fig. 4. Here again we give label 1 to the single coordinate u j, j = 1, 2 (each ruled by a corresponding uniform random variable U j ) if it falls below a given threshold p j, label 0 otherwise. Moreover we give to the point a i of coordinates (u 1, u 2 ) a label equal to the product of the labels of the single coordinates. Thus the probability p c that a point a falls in the open rectangle c bounded by the coordinate axes and the two mentioned threshold lines (for short we will henceforth refer to these rectangles as bounds' rectangles) is p 1 p 2. Let us complicate the inference scheme in two ways: 1. We move from U j to the family of uniform random variables Z j in [0, θ] explained by the function z = θ u with θ (0, + ). 2. We maintain the same labeling rule but do not know c, i.e. the thresholds p 1 and p 2. Rather, within the class of bounds' rectangles containing all 1-labeled sample points yet excluding all 0- labeled sample points (consistent bounds' rectangles as statistics), we will identify it with the maximal one h, i.e. the one having the largest consistent edges (just before the closest 0-labeled points). Letting p' 1 and p' 2 be the length of these edges, we presently look for the probability p h = p' 1 p' 2 /θ 2 representing the asymptotic frequency with which future points (generated with the above sampling mechanism for any θ) will fall in h. We may imagine a whole family of sequences of domains B, each sequence pivoted on a possible rectangle. In some sequences the domain B p of measure p will include the pivot, in other ones will be included by it. Thus we need witnesses that our actual h computed from the actual sample constitutes a pivot included in B p. But this happens if two special points exactly the negative one preventing the rectangle expanding on the left and

5 the negative one preventing the rectangle expanding on the up are included in B p. Thus let us enrich the family of sequences having for each rectangle and each possible pair of witness points the pivot constituted by the union of the rectangle with these points. In respect to the sequence pivoted on our actual rectangle and witnesses we have that if 2 or more negative points are included in B p for sure the witness points are among them. Hence a twisting argument reads: ( p h < p ) k p k + 2 ( ) (18) where k p is still a specification of a Binomial random variable of parameters m (sample size) and p, accounting for the sample points contained in B p. From the left, let us consider the family of bounds' rectangles As θ is free, p h < p ( ) requires that there must be an enlargement of h whose measure for a proper θ exactly equals p. But since both edges of h are bounded by a negative point, this enlargement must contain at least one point more than h itself. Formally ( k p k +1) p h < p ( ) (19) Putting together the two pieces of twisting argument, we obtain the corresponding bounds on probabilities as follows: possible labelings mechanisms, we stored the sample point labelings according to all rectangles with vertices in a suitable discretization (steps of 1/10 on each edge) of the unitary hypercube. Then we forgot the source figures and for each labeling computed the maximal consistent bounds' rectangle h and we drew the graph in Fig. 4(b). Namely, we drew 10 samples as before, and for each sample labeling we reported on the graph the actual frequency φ and the probability p h (analytically computed on the basis of the rectangle coordinates) of drawing a point in Ψ belonging to the guessed maximal hypothesis. On the same graph we also reported the curves describing the course of the symmetric 0.9 confidence interval for P h with the observed frequency of falling inside the rectangle h according to (16). Finally, for comparison's sake we (a) ( ) P( P h < p ) = F Ph ( p ) P( K p k + 2) P K p k +1 (20) Generalizing our arguments to an n-dimensional space, we recognize that the number of witnessing points of expansions of the maximal hypotheses on bounds' rectangles are at most n. Hence n constitutes the detail µ of this class of hypotheses. We extended the experiment shown in Fig. 2 as follows: we built a sample of m=30 elements drawing random Y i coordinates. To stress the spread of the confidence intervals we abandoned the uniform distribution; namely we used the sampling mechanism (U, g(u)) with g(u) = u j ) for computing the j-th coordinate (namely, the first coordinate is u, the second u 2, etc.). Moreover, to mark the drift from the Bernoulli variable, we refer to a fourdimensional rectangle in Ψ=[0,1] 4, in respect to which the rectangle in Fig. 4(a) could represent just a projection. Then, to figure out a wide set of Fig. 4. Generating 0.9 confidence intervals for the probability P h of a bounds' rectangle in Ψ=[0,1] 4 from a sample of 30 elements. (a) The drawn sample and one of its possible labelings in a two-dimensional projection. Bullets: 1-labeled (positive) sampled points; diamonds: 0-labeled (negative) sampled points. (b) Points: curves of the frequency φ and probability p h of falling inside a bounds' rectangle with a close lattice of labeling functions. Dashed curves: trajectories described by the confidence interval extremes with reference to µ =1. Plain curves: trajectories described by the confidence interval extremes with reference to µ =4. (b)

6 also drew the curves obtained for a pure Bernoulli variable. In spite of some apparent unbalancing in the figure, the percentages of points falling out of upper and lower bound curves are approximately equal, 3.32 and 3.67% respectively. Thus satisfying the 5% upper bounds used for drawing these curves. The analogous percentages, 16.2 and 0.28%, denote the inadequacy of the curves drawn for the Bernoulli distribution. The smaller values of actual versus allowed bounds trespassers can be attributed to the worst case duty of our curves: i.e. they must guarantee a given confidence whatever the underlying distribution law is. The above check on domain measures is the key action of any PAC learning task, where confidence intervals like the ones in Fig. 4(b) are the ultimate probabilistic learning target. Indeed, the distinguishing features of the above case study are the following: We are building a domain h on the basis of the sampled coordinates (besides the a priori specifications that the rectangle left-lower vertex coincides with the axes origin and edges orientation are parallel to them). Though coming from independent U i 's, some sample points, those labeled by 1, share the fact of being all inside h, and those labeled by 0 vice versa. The above twisting argument holds whatever the coordinates joint distribution law is. The experiment in the figure confirms that the sole probabilistic consequence of these additional features in comparison to the original case study of Fig. 2 is in the bounds of the confidence region, now pushed up by the fact that four points in place of one need to witness the inclusion of h in a proper domain of measure p ~ and at least one is additionally included in an enlargement of h with this measure. 3 Conclusions PAC learning represents a very innovative perspective in inferential statistics which relates the randomness of the sample data to the mutual structure deriving from their syntactical properties. We set up an inferential mechanism and a structural complexity index to stress this idea. Assuming a source of uniformly random data, we want to discover the function g θ mapping these data into random variables of a given and possibly unknown distribution law. In the case of Bernoulli variables whose values are related to ancillary data, a prominent part of our inference may lie in fixing this relation, which is exactly an instance of the problem of learning Boolean functions. In the paper we show the benefit of this contrivance in terms of its capability of studying the distribution law of the error risk, in connection with specific features of the computed hypothesis such as the detail of its class. From an operational viewpoint this benefit translates in a more favourable relation between sample size and accuracy parameters ε and δ. In particular we obtain a definite narrowing of the confidence intervals of the error risk in respect to those usually computed in the Vapnik approach. In a more philosophical perspective, our approach provides a clear statistical rationale to the learning theory putting the premises for new development of this theory. References: [1] B. Apolloni and S. Chiaravalli, PAC learning of concept classes through the boundaries of their items, Theoretical Computer Science 172, 1997, [2] B. Apolloni, E. Esposito, D. Malchiodi, C. Orovas, G. Palmas and J. G. Taylor, A General Framework for Learning Rules from Data, IEEE Transactions on Neural Networks, 2004, to appear. [3] B. Apolloni and D. Malchiodi, Gaining degrees of freedom in subsymbolic learning, Theoretical Computer Science 255, 2001, [4] B. Apolloni, D. Malchiodi and S. Gaito, Algorithmic Inference in Machine Learning, International Series on Advanced Intelligence Vol. 5, Advanced Knowledge International, Magill, Adelaide, [5] A. Blumer, A. Ehrenfreucht, D. Haussler, M. Warmuth, Learnability and the Vapnik- Chervonenkis Dimension, Journal of the ACM 36, 1989, [6] J. W. Tukey, Non-parametric estimation II. Statistically equivalent blocks and tolerance regions the continuous case, Annals of Mathematical Statistics 18, 1947, [7] L. G. Valiant, A theory of the learnable, Communications of the ACM 11 (27), 1984, [8] V. Vapnik, Statistical Learning Theory, John Wiley, New York, 1998.

Narrowing confidence interval width of PAC learning risk function by algorithmic inference

Narrowing confidence interval width of PAC learning risk function by algorithmic inference Narrowing confidence interval width of PAC learning risk function by algorithmic inference Bruno Apolloni, Dario Malchiodi Dip. di Scienze dell Informazione, Università degli Studi di Milano Via Comelico

More information

On the Sample Complexity of Noise-Tolerant Learning

On the Sample Complexity of Noise-Tolerant Learning On the Sample Complexity of Noise-Tolerant Learning Javed A. Aslam Department of Computer Science Dartmouth College Hanover, NH 03755 Scott E. Decatur Laboratory for Computer Science Massachusetts Institute

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization : Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage

More information

Statistical and Computational Learning Theory

Statistical and Computational Learning Theory Statistical and Computational Learning Theory Fundamental Question: Predict Error Rates Given: Find: The space H of hypotheses The number and distribution of the training examples S The complexity of the

More information

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY Dan A. Simovici UMB, Doctoral Summer School Iasi, Romania What is Machine Learning? The Vapnik-Chervonenkis Dimension Probabilistic Learning Potential

More information

CS 6375: Machine Learning Computational Learning Theory

CS 6375: Machine Learning Computational Learning Theory CS 6375: Machine Learning Computational Learning Theory Vibhav Gogate The University of Texas at Dallas Many slides borrowed from Ray Mooney 1 Learning Theory Theoretical characterizations of Difficulty

More information

10.1 The Formal Model

10.1 The Formal Model 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms

More information

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997 A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997 Vasant Honavar Artificial Intelligence Research Laboratory Department of Computer Science

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory. CS534 - Machine Learning Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows

More information

Computational Learning Theory

Computational Learning Theory 1 Computational Learning Theory 2 Computational learning theory Introduction Is it possible to identify classes of learning problems that are inherently easy or difficult? Can we characterize the number

More information

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht PAC Learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht Recall: PAC Learning (Version 1) A hypothesis class H is PAC learnable

More information

Models of Language Acquisition: Part II

Models of Language Acquisition: Part II Models of Language Acquisition: Part II Matilde Marcolli CS101: Mathematical and Computational Linguistics Winter 2015 Probably Approximately Correct Model of Language Learning General setting of Statistical

More information

A Necessary Condition for Learning from Positive Examples

A Necessary Condition for Learning from Positive Examples Machine Learning, 5, 101-113 (1990) 1990 Kluwer Academic Publishers. Manufactured in The Netherlands. A Necessary Condition for Learning from Positive Examples HAIM SHVAYTSER* (HAIM%SARNOFF@PRINCETON.EDU)

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

Computational Learning Theory

Computational Learning Theory CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful

More information

On Learning µ-perceptron Networks with Binary Weights

On Learning µ-perceptron Networks with Binary Weights On Learning µ-perceptron Networks with Binary Weights Mostefa Golea Ottawa-Carleton Institute for Physics University of Ottawa Ottawa, Ont., Canada K1N 6N5 050287@acadvm1.uottawa.ca Mario Marchand Ottawa-Carleton

More information

Sample width for multi-category classifiers

Sample width for multi-category classifiers R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University

More information

k-fold unions of low-dimensional concept classes

k-fold unions of low-dimensional concept classes k-fold unions of low-dimensional concept classes David Eisenstat September 2009 Abstract We show that 2 is the minimum VC dimension of a concept class whose k-fold union has VC dimension Ω(k log k). Keywords:

More information

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell CS340 Machine learning Lecture 4 Learning theory Some slides are borrowed from Sebastian Thrun and Stuart Russell Announcement What: Workshop on applying for NSERC scholarships and for entry to graduate

More information

Neural Network Learning: Testing Bounds on Sample Complexity

Neural Network Learning: Testing Bounds on Sample Complexity Neural Network Learning: Testing Bounds on Sample Complexity Joaquim Marques de Sá, Fernando Sereno 2, Luís Alexandre 3 INEB Instituto de Engenharia Biomédica Faculdade de Engenharia da Universidade do

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

Maximal Width Learning of Binary Functions

Maximal Width Learning of Binary Functions Maximal Width Learning of Binary Functions Martin Anthony Department of Mathematics, London School of Economics, Houghton Street, London WC2A2AE, UK Joel Ratsaby Electrical and Electronics Engineering

More information

Computational Learning Theory for Artificial Neural Networks

Computational Learning Theory for Artificial Neural Networks Computational Learning Theory for Artificial Neural Networks Martin Anthony and Norman Biggs Department of Statistical and Mathematical Sciences, London School of Economics and Political Science, Houghton

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Sinh Hoa Nguyen, Hung Son Nguyen Polish-Japanese Institute of Information Technology Institute of Mathematics, Warsaw University February 14, 2006 inh Hoa Nguyen, Hung Son

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical

More information

Computational Learning Theory. Definitions

Computational Learning Theory. Definitions Computational Learning Theory Computational learning theory is interested in theoretical analyses of the following issues. What is needed to learn effectively? Sample complexity. How many examples? Computational

More information

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 600.463 Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 25.1 Introduction Today we re going to talk about machine learning, but from an

More information

Solving Classification Problems By Knowledge Sets

Solving Classification Problems By Knowledge Sets Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose

More information

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016 Machine Learning 10-701, Fall 2016 Computational Learning Theory Eric Xing Lecture 9, October 5, 2016 Reading: Chap. 7 T.M book Eric Xing @ CMU, 2006-2016 1 Generalizability of Learning In machine learning

More information

The Vapnik-Chervonenkis Dimension

The Vapnik-Chervonenkis Dimension The Vapnik-Chervonenkis Dimension Prof. Dan A. Simovici UMB 1 / 91 Outline 1 Growth Functions 2 Basic Definitions for Vapnik-Chervonenkis Dimension 3 The Sauer-Shelah Theorem 4 The Link between VCD and

More information

A Result of Vapnik with Applications

A Result of Vapnik with Applications A Result of Vapnik with Applications Martin Anthony Department of Statistical and Mathematical Sciences London School of Economics Houghton Street London WC2A 2AE, U.K. John Shawe-Taylor Department of

More information

Yale University Department of Computer Science. The VC Dimension of k-fold Union

Yale University Department of Computer Science. The VC Dimension of k-fold Union Yale University Department of Computer Science The VC Dimension of k-fold Union David Eisenstat Dana Angluin YALEU/DCS/TR-1360 June 2006, revised October 2006 The VC Dimension of k-fold Union David Eisenstat

More information

Name (NetID): (1 Point)

Name (NetID): (1 Point) CS446: Machine Learning Fall 2016 October 25 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains four

More information

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018 Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability

More information

Algorithmic Probability

Algorithmic Probability Algorithmic Probability From Scholarpedia From Scholarpedia, the free peer-reviewed encyclopedia p.19046 Curator: Marcus Hutter, Australian National University Curator: Shane Legg, Dalle Molle Institute

More information

1 The Probably Approximately Correct (PAC) Model

1 The Probably Approximately Correct (PAC) Model COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #3 Scribe: Yuhui Luo February 11, 2008 1 The Probably Approximately Correct (PAC) Model A target concept class C is PAC-learnable by

More information

Discriminative Learning can Succeed where Generative Learning Fails

Discriminative Learning can Succeed where Generative Learning Fails Discriminative Learning can Succeed where Generative Learning Fails Philip M. Long, a Rocco A. Servedio, b,,1 Hans Ulrich Simon c a Google, Mountain View, CA, USA b Columbia University, New York, New York,

More information

Computational Learning Theory - Hilary Term : Introduction to the PAC Learning Framework

Computational Learning Theory - Hilary Term : Introduction to the PAC Learning Framework Computational Learning Theory - Hilary Term 2018 1 : Introduction to the PAC Learning Framework Lecturer: Varun Kanade 1 What is computational learning theory? Machine learning techniques lie at the heart

More information

Part of the slides are adapted from Ziko Kolter

Part of the slides are adapted from Ziko Kolter Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,

More information

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14 Learning Theory Piyush Rai CS5350/6350: Machine Learning September 27, 2011 (CS5350/6350) Learning Theory September 27, 2011 1 / 14 Why Learning Theory? We want to have theoretical guarantees about our

More information

THE information capacity is one of the most important

THE information capacity is one of the most important 256 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 1, JANUARY 1998 Capacity of Two-Layer Feedforward Neural Networks with Binary Weights Chuanyi Ji, Member, IEEE, Demetri Psaltis, Senior Member,

More information

VC-DENSITY FOR TREES

VC-DENSITY FOR TREES VC-DENSITY FOR TREES ANTON BOBKOV Abstract. We show that for the theory of infinite trees we have vc(n) = n for all n. VC density was introduced in [1] by Aschenbrenner, Dolich, Haskell, MacPherson, and

More information

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims CS340 Machine learning Lecture 5 Learning theory cont'd Some slides are borrowed from Stuart Russell and Thorsten Joachims Inductive learning Simplest form: learn a function from examples f is the target

More information

Intelligent Systems:

Intelligent Systems: Intelligent Systems: Undirected Graphical models (Factor Graphs) (2 lectures) Carsten Rother 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM Roadmap for next two lectures Definition

More information

CS446: Machine Learning Spring Problem Set 4

CS446: Machine Learning Spring Problem Set 4 CS446: Machine Learning Spring 2017 Problem Set 4 Handed Out: February 27 th, 2017 Due: March 11 th, 2017 Feel free to talk to other members of the class in doing the homework. I am more concerned that

More information

Dover- Sherborn High School Mathematics Curriculum Probability and Statistics

Dover- Sherborn High School Mathematics Curriculum Probability and Statistics Mathematics Curriculum A. DESCRIPTION This is a full year courses designed to introduce students to the basic elements of statistics and probability. Emphasis is placed on understanding terminology and

More information

Generalization theory

Generalization theory Generalization theory Chapter 4 T.P. Runarsson (tpr@hi.is) and S. Sigurdsson (sven@hi.is) Introduction Suppose you are given the empirical observations, (x 1, y 1 ),..., (x l, y l ) (X Y) l. Consider the

More information

Statistical Learning Learning From Examples

Statistical Learning Learning From Examples Statistical Learning Learning From Examples We want to estimate the working temperature range of an iphone. We could study the physics and chemistry that affect the performance of the phone too hard We

More information

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Computational Learning Theory Le Song Lecture 11, September 20, 2012 Based on Slides from Eric Xing, CMU Reading: Chap. 7 T.M book 1 Complexity of Learning

More information

Maximal Width Learning of Binary Functions

Maximal Width Learning of Binary Functions Maximal Width Learning of Binary Functions Martin Anthony Department of Mathematics London School of Economics Houghton Street London WC2A 2AE United Kingdom m.anthony@lse.ac.uk Joel Ratsaby Ben-Gurion

More information

Math Review Sheet, Fall 2008

Math Review Sheet, Fall 2008 1 Descriptive Statistics Math 3070-5 Review Sheet, Fall 2008 First we need to know about the relationship among Population Samples Objects The distribution of the population can be given in one of the

More information

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

CS446: Machine Learning Fall Final Exam. December 6 th, 2016 CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains

More information

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity; CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti MACHINE LEARNING Vapnik-Chervonenkis (VC) Dimension Alessandro Moschitti Department of Information Engineering and Computer Science University of Trento Email: moschitti@disi.unitn.it Computational Learning

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

On Learnability, Complexity and Stability

On Learnability, Complexity and Stability On Learnability, Complexity and Stability Silvia Villa, Lorenzo Rosasco and Tomaso Poggio 1 Introduction A key question in statistical learning is which hypotheses (function) spaces are learnable. Roughly

More information

HANDBOOK OF APPLICABLE MATHEMATICS

HANDBOOK OF APPLICABLE MATHEMATICS HANDBOOK OF APPLICABLE MATHEMATICS Chief Editor: Walter Ledermann Volume VI: Statistics PART A Edited by Emlyn Lloyd University of Lancaster A Wiley-Interscience Publication JOHN WILEY & SONS Chichester

More information

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Review. DS GA 1002 Statistical and Mathematical Models.   Carlos Fernandez-Granda Review DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall16 Carlos Fernandez-Granda Probability and statistics Probability: Framework for dealing with

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension Prof. Dan A. Simovici UMB Prof. Dan A. Simovici (UMB) MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension 1 / 30 The

More information

EVALUATING MISCLASSIFICATION PROBABILITY USING EMPIRICAL RISK 1. Victor Nedel ko

EVALUATING MISCLASSIFICATION PROBABILITY USING EMPIRICAL RISK 1. Victor Nedel ko 94 International Journal "Information Theories & Applications" Vol13 [Raudys, 001] Raudys S, Statistical and neural classifiers, Springer, 001 [Mirenkova, 00] S V Mirenkova (edel ko) A method for prediction

More information

The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors

The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors Martin Anthony Department of Mathematics London School of Economics and Political Science Houghton Street, London WC2A2AE

More information

2. Continuous Distributions

2. Continuous Distributions 1 of 12 7/16/2009 5:35 AM Virtual Laboratories > 3. Distributions > 1 2 3 4 5 6 7 8 2. Continuous Distributions Basic Theory As usual, suppose that we have a random experiment with probability measure

More information

Performance Evaluation and Comparison

Performance Evaluation and Comparison Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation

More information

The Vapnik-Chervonenkis Dimension: Information versus Complexity in Learning

The Vapnik-Chervonenkis Dimension: Information versus Complexity in Learning REVlEW The Vapnik-Chervonenkis Dimension: Information versus Complexity in Learning Yaser S. Abu-Mostafa California Institute of Terhnohgy, Pasadcna, CA 91 125 USA When feasible, learning is a very attractive

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

International Journal "Information Theories & Applications" Vol.14 /

International Journal Information Theories & Applications Vol.14 / International Journal "Information Theories & Applications" Vol.4 / 2007 87 or 2) Nˆ t N. That criterion and parameters F, M, N assign method of constructing sample decision function. In order to estimate

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Practice Problems Section Problems

Practice Problems Section Problems Practice Problems Section 4-4-3 4-4 4-5 4-6 4-7 4-8 4-10 Supplemental Problems 4-1 to 4-9 4-13, 14, 15, 17, 19, 0 4-3, 34, 36, 38 4-47, 49, 5, 54, 55 4-59, 60, 63 4-66, 68, 69, 70, 74 4-79, 81, 84 4-85,

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20. 10-601 Machine Learning, Midterm Exam: Spring 2008 Please put your name on this cover sheet If you need more room to work out your answer to a question, use the back of the page and clearly mark on the

More information

Ahlswede Khachatrian Theorems: Weighted, Infinite, and Hamming

Ahlswede Khachatrian Theorems: Weighted, Infinite, and Hamming Ahlswede Khachatrian Theorems: Weighted, Infinite, and Hamming Yuval Filmus April 4, 2017 Abstract The seminal complete intersection theorem of Ahlswede and Khachatrian gives the maximum cardinality of

More information

References for online kernel methods

References for online kernel methods References for online kernel methods W. Liu, J. Principe, S. Haykin Kernel Adaptive Filtering: A Comprehensive Introduction. Wiley, 2010. W. Liu, P. Pokharel, J. Principe. The kernel least mean square

More information

Lecture 29: Computational Learning Theory

Lecture 29: Computational Learning Theory CS 710: Complexity Theory 5/4/2010 Lecture 29: Computational Learning Theory Instructor: Dieter van Melkebeek Scribe: Dmitri Svetlov and Jake Rosin Today we will provide a brief introduction to computational

More information

California Common Core State Standards for Mathematics Standards Map Mathematics I

California Common Core State Standards for Mathematics Standards Map Mathematics I A Correlation of Pearson Integrated High School Mathematics Mathematics I Common Core, 2014 to the California Common Core State s for Mathematics s Map Mathematics I Copyright 2017 Pearson Education, Inc.

More information

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Lecture Slides for INTRODUCTION TO. Machine Learning. By:  Postedited by: R. Lecture Slides for INTRODUCTION TO Machine Learning By: alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml Postedited by: R. Basili Learning a Class from Examples Class C of a family car Prediction:

More information

On the Average Complexity of Brzozowski s Algorithm for Deterministic Automata with a Small Number of Final States

On the Average Complexity of Brzozowski s Algorithm for Deterministic Automata with a Small Number of Final States On the Average Complexity of Brzozowski s Algorithm for Deterministic Automata with a Small Number of Final States Sven De Felice 1 and Cyril Nicaud 2 1 LIAFA, Université Paris Diderot - Paris 7 & CNRS

More information

Online Learning, Mistake Bounds, Perceptron Algorithm

Online Learning, Mistake Bounds, Perceptron Algorithm Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Part III. A Decision-Theoretic Approach and Bayesian testing

Part III. A Decision-Theoretic Approach and Bayesian testing Part III A Decision-Theoretic Approach and Bayesian testing 1 Chapter 10 Bayesian Inference as a Decision Problem The decision-theoretic framework starts with the following situation. We would like to

More information

The PAC Learning Framework -II

The PAC Learning Framework -II The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline

More information

Chapter 2.5 Random Variables and Probability The Modern View (cont.)

Chapter 2.5 Random Variables and Probability The Modern View (cont.) Chapter 2.5 Random Variables and Probability The Modern View (cont.) I. Statistical Independence A crucially important idea in probability and statistics is the concept of statistical independence. Suppose

More information

Approximation Properties of Positive Boolean Functions

Approximation Properties of Positive Boolean Functions Approximation Properties of Positive Boolean Functions Marco Muselli Istituto di Elettronica e di Ingegneria dell Informazione e delle Telecomunicazioni, Consiglio Nazionale delle Ricerche, via De Marini,

More information

A Bound on the Label Complexity of Agnostic Active Learning

A Bound on the Label Complexity of Agnostic Active Learning A Bound on the Label Complexity of Agnostic Active Learning Steve Hanneke March 2007 CMU-ML-07-103 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Machine Learning Department,

More information

1 Differential Privacy and Statistical Query Learning

1 Differential Privacy and Statistical Query Learning 10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 5: December 07, 015 1 Differential Privacy and Statistical Query Learning 1.1 Differential Privacy Suppose

More information

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds Lecture 25 of 42 PAC Learning, VC Dimension, and Mistake Bounds Thursday, 15 March 2007 William H. Hsu, KSU http://www.kddresearch.org/courses/spring2007/cis732 Readings: Sections 7.4.17.4.3, 7.5.17.5.3,

More information

Lecture 5: Efficient PAC Learning. 1 Consistent Learning: a Bound on Sample Complexity

Lecture 5: Efficient PAC Learning. 1 Consistent Learning: a Bound on Sample Complexity Universität zu Lübeck Institut für Theoretische Informatik Lecture notes on Knowledge-Based and Learning Systems by Maciej Liśkiewicz Lecture 5: Efficient PAC Learning 1 Consistent Learning: a Bound on

More information

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,

More information

Lecture 1: Bayesian Framework Basics

Lecture 1: Bayesian Framework Basics Lecture 1: Bayesian Framework Basics Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de April 21, 2014 What is this course about? Building Bayesian machine learning models Performing the inference of

More information

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida Bayesian Statistical Methods Jeff Gill Department of Political Science, University of Florida 234 Anderson Hall, PO Box 117325, Gainesville, FL 32611-7325 Voice: 352-392-0262x272, Fax: 352-392-8127, Email:

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information