Introduction to Machine Learning CMU-10701

Similar documents
Advanced Introduction to Machine Learning CMU-10715

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning

Theorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension. n=1

Introduction to Machine Learning CMU-10701

STAT 7032 Probability. Wlodek Bryc

March 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang.

The main results about probability measures are the following two facts:

Kernel Methods. Barnabás Póczos

STAT 7032 Probability Spring Wlodek Bryc

CS 160: Lecture 16. Quantitative Studies. Outline. Random variables and trials. Random variables. Qualitative vs. Quantitative Studies

Economics 583: Econometric Theory I A Primer on Asymptotics

1 Appendix A: Matrix Algebra

Probability Theory for Machine Learning. Chris Cremer September 2015

STAT 200C: High-dimensional Statistics

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Product measure and Fubini s theorem

A Primer on Asymptotics

X = X X n, + X 2

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Machine Learning: Assignment 1

CPSC 340: Machine Learning and Data Mining

Introduction to Machine Learning CMU-10701

Machine Learning for Computational Advertising

Lecture Notes 5 Convergence and Limit Theorems. Convergence with Probability 1. Convergence in Mean Square. Convergence in Probability, WLLN

The Naïve Bayes Classifier. Machine Learning Fall 2017

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Lecture Notes 3 Convergence (Chapter 5)

PAC-learning, VC Dimension and Margin-based Bounds

Exponential Tail Bounds

3. Review of Probability and Statistics

Lecture 2 Sep 5, 2017

Linear Models for Regression CS534

Machine Learning. Instructor: Pranjal Awasthi

Advanced Introduction to Machine Learning CMU-10715

15-388/688 - Practical Data Science: Basic probability. J. Zico Kolter Carnegie Mellon University Spring 2018

High Dimensional Probability

Probability and Statistics Concepts

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 20

6.1 Moment Generating and Characteristic Functions

Midterm: CS 6375 Spring 2018

Lecture 1 Measure concentration

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Bayesian Learning (II)

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Why should you care?? Intellectual curiosity. Gambling. Mathematically the same as the ESP decision problem we discussed in Week 4.

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

The PAC Learning Framework -II

Lecture 2: Review of Basic Probability Theory

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Why study probability? Set theory. ECE 6010 Lecture 1 Introduction; Review of Random Variables

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

18.175: Lecture 13 Infinite divisibility and Lévy processes

IFT Lecture 7 Elements of statistical learning theory

18.175: Lecture 8 Weak laws and moment-generating/characteristic functions

Lecture 3: Statistical sampling uncertainty

Expectation is linear. So far we saw that E(X + Y ) = E(X) + E(Y ). Let α R. Then,

Statistical Machine Learning

COMPSCI 240: Reasoning Under Uncertainty

Quick Tour of Basic Probability Theory and Linear Algebra

CSC 411: Lecture 09: Naive Bayes

Linear Models for Regression CS534

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing

EE514A Information Theory I Fall 2013

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Nonparametric Bayesian Methods (Gaussian Processes)

Convex Optimization CMU-10725

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Overview. Confidence Intervals Sampling and Opinion Polls Error Correcting Codes Number of Pet Unicorns in Ireland

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

MLE/MAP + Naïve Bayes

Learning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38

CS 246 Review of Proof Techniques and Probability 01/14/19

Example continued. Math 425 Intro to Probability Lecture 37. Example continued. Example

Math 105 Course Outline

The Central Limit Theorem

Function Approximation

Introduction to Machine Learning

Review (Probability & Linear Algebra)

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

The Representor Theorem, Kernels, and Hilbert Spaces

The Bayes classifier

Logistic Regression. Machine Learning Fall 2018

ELEG 3143 Probability & Stochastic Process Ch. 2 Discrete Random Variables

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

PAC-learning, VC Dimension and Margin-based Bounds

Chapter 6. Convergence. Probability Theory. Four different convergence concepts. Four different convergence concepts. Convergence in probability

8 Laws of large numbers

Lecture 11. Probability Theory: an Overveiw

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

CSC321 Lecture 18: Learning Probabilistic Models

CS 361: Probability & Statistics

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probability and Estimation. Alan Moses

Review of Probabilities and Basic Statistics

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Bias-Variance Tradeoff

Transcription:

Introduction to Machine Learning CMU-10701 Stochastic Convergence Barnabás Póczos

Motivation 2

What have we seen so far? Several algorithms that seem to work fine on training datasets: Linear regression Naïve Bayes classifier Perceptron classifier Support Vector Machines for regression and classification How good are these algorithms on unknown test sets? How many training samples do we need to achieve small error? What is the smallest possible error we can achieve? ) Learning Theory To answer these questions, we will need a few powerful tools 3

Basic Estimation Theory 4

Tossing a Dice, Estimation of parameters 1, 2,, 6 12 24 Does the MLE estimation converge to the right value? How fast does it converge? 60 120 5

Tossing a Dice Calculating the Empirical Average Does the empirical average converge to the true mean? How fast does it converge? 6

Tossing a Dice, Calculating the Empirical Average 5 sample traces How fast do they converge? 7

Key Questions Do empirical averages converge? Does the MLE converge in the dice tossing problem? What do we mean on convergence? What is the rate of convergence? I want to know the coin parameter 2[0,1] within = 0.1 error, with probability at least 1- = 0.95. How many flips do I need? Applications: drug testing (Does this drug modifies the average blood pressure?) user interface design (We will see later) 8

Outline Theory: Stochastic Convergences: Weak convergence Convergence in probability Strong (almost surely) Limit theorems: Law of large numbers Central limit theorem Tail bounds: Markov, Chebyshev,Chernoff, Hoeffding, Bernstein, McDiarmid inequalities Application: A/B testing for page layout 9

Stochastic convergence definitions and properties 10

Convergence of vectors 11

Convergence in Distribution = Convergence Weakly = Convergence in Law Let {Z, Z 1, Z 2, } be a sequence of random variables. Notation: Definition: This is the weakest convergence. 12

Convergence in Distribution = Convergence Weakly = Convergence in Law Only the distribution functions converge! (NOT the values of the random variables) 1 0 a 13

Convergence in Distribution = Convergence Weakly = Convergence in Law Continuity is important! Example: Proof: 1 0 0 1/n 1 0 0 The limit random variable is constant 0: In this example the limit Z is discrete, not random (constant 0), although Z n is a continuous random variable. 14

Properties Convergence in Distribution = Convergence Weakly = Convergence in Law Z n and Z can still be independent even if their distributions are the same! Scheffe's theorem: convergence of the probability density functions ) convergence in distribution Example: (Central Limit Theorem) 15

Convergence in Probability Notation: Definition: This indeed measures how far the values of Z n ( ) and Z( ) are from each other. 16

Almost Surely Convergence Notation: Definition: 17

Convergence in p-th mean, L p norm Notation: Definition: Properties: 18

Counter Examples 19

Further Readings on Stochastic convergence http://en.wikipedia.org/wiki/convergence_of_random_variables Patrick Billingsley: Probability and Measure Patrick Billingsley: Convergence of Probability Measures 20

Finite sample tail bounds Useful tools! 21

Gauss Markov inequality If X is any nonnegative random variable and a > 0, then Proof: Decompose the expectation Corollary: Chebyshev's inequality 22

Chebyshev inequality If X is any nonnegative random variable and a > 0, then Here Var(X) is the variance of X, defined as: Proof: 23

Chebyshev: Generalizations of Chebyshev's inequality This is equivalent to this: Symmetric two-sided case (X is symmetric distribution) Asymmetric two-sided case (X is asymmetric distribution) There are lots of other generalizations, for example multivariate X. 24

Higher moments? Markov: Chebyshev: Higher moments: where n 1 Other functions instead of polynomials? Exp function: Proof: (Markov ineq.) 25

Law of Large Numbers 26

Do empirical averages converge? Chebyshev s inequality is good enough to study the question: Do the empirical averages converge to the true mean? Answer: Yes, they do. (Law of large numbers) 27

Law of Large Numbers Weak Law of Large Numbers: Strong Law of Large Numbers: 28

Weak Law of Large Numbers Proof I: Assume finite variance. (Not very important) Therefore, As n approaches infinity, this expression approaches 1. 29

Fourier Transform and Characteristic Function 30

Fourier Transform Fourier transform unitary transf. Inverse Fourier transform Other conventions: Where to put 2? Not preferred: not unitary transf. Doesn t preserve inner product unitary transf. 31

Fourier Transform Fourier transform Inverse Fourier transform Properties: Inverse is really inverse: and lots of other important ones Fourier transformation will be used to define the characteristic function, and represent the distributions in an alternative way. 32

Characteristic function How can we describe a random variable? cumulative distribution function (cdf) probability density function (pdf) The Characteristic function provides an alternative way for describing a random variable Definition: The Fourier transform of the density/ 33

Characteristic function Properties For example, Cauchy doesn t have mean but still has characteristic function. Continuous on the entire space, even if X is not continuous. Bounded, even if X is not bounded Characteristic function of constant a: Levi s: continuity theorem 34

Weak Law of Large Numbers Proof II: Taylor's theorem for complex functions The Characteristic function Properties of characteristic functions : Levi s continuity theorem ) Limit is a constant distribution with mean mean 35

Convergence rate for LLN Gauss-Markov: Doesn t give rate Chebyshev: with probability 1- Can we get smaller, logarithmic error in δ??? 36

Further Readings on LLN, Characteristic Functions, etc http://en.wikipedia.org/wiki/levy_continuity_theorem http://en.wikipedia.org/wiki/law_of_large_numbers http://en.wikipedia.org/wiki/characteristic_function_(probability_theory) http://en.wikipedia.org/wiki/fourier_transform 37

More tail bounds More useful tools! 38

Hoeffding s inequality (1963) It only contains the range of the variables, but not the variances. 39

Convergence rate for LLN Hoeffding from Hoeffding 40

Proof of Hoeffding s Inequality A few minutes of calculations. 41

Bernstein s inequality (1946) It contains the variances, too, and can give tighter bounds than Hoeffding. 42

Benett s inequality (1962) Proof: Benett s inequality ) Bernstein s inequality. 43

McDiarmid s Bounded Difference Inequality It follows that 44

Further Readings on Tail bounds http://en.wikipedia.org/wiki/hoeffding's_inequality http://en.wikipedia.org/wiki/doob_martingale (McDiarmid) http://en.wikipedia.org/wiki/bennett%27s_inequality http://en.wikipedia.org/wiki/markov%27s_inequality http://en.wikipedia.org/wiki/chebyshev%27s_inequality http://en.wikipedia.org/wiki/bernstein_inequalities_(probability_theory) 45

Limit Distribution? 46

Central Limit Theorem Lindeberg-Lévi CLT: Lyapunov CLT: + some other conditions Generalizations: multi dim, time processes 47

Central Limit Theorem in Practice unscaled scaled 48

Proof of CLT From Taylor series around 0: Properties of characteristic functions : Levi s continuity theorem + uniqueness ) CLT characteristic function of Gauss distribution 49

How fast do we converge to Gauss distribution? CLT: It doesn t tell us anything about the convergence rate. Berry-Esseen Theorem Independently discovered by A. C. Berry (in 1941) and C.-G. Esseen (1942) 50

Did we answer the questions we asked? Do empirical averages converge? What do we mean on convergence? What is the rate of convergence? What is the limit distrib. of standardized averages? Next time we will continue with these questions: How good are the ML algorithms on unknown test sets? How many training samples do we need to achieve small error? What is the smallest possible error we can achieve? 51

Further Readings on CLT http://en.wikipedia.org/wiki/central_limit_theorem http://en.wikipedia.org/wiki/law_of_the_iterated_logarithm 52

Tail bounds in practice 53

A/B testing Two possible webpage layouts Which layout is better? Experiment Some users see A The others see design B How many trials do we need to decide which page attracts more clicks? 54

A/B testing Let us simplify this question a bit: Assume that in group A p(click A) = 0.10 click and p(noclick A) = 0.90 Assume that in group B p(click B) = 0.11 click and p(noclick A) = 0.89 Assume also that we know these probabilities in group A, but we don t know yet them in group B. We want to estimate p(click B) with less than 0.01 error 55

Chebyshev Inequality Chebyshev: In group B the click probability is = 0.11 (we don t know this yet) Want failure probability of =5% If we have no prior knowledge, we can only bound the variance by σ 2 = 0.25 (Uniform distribution hast the largest variance 0.25) If we know that the click probability is < 0.15, then we can bound 2 at 0.15 * 0.85 = 0.1275. This requires at most 25,500 users. 56

Hoeffding s bound Hoeffding Random variable has bounded range [0, 1] (click or no click), hence c=1 Solve Hoeffding s inequality for n: This is better than Chebyshev. 57

Thanks for your attention 58