Statistical Learning Theory - PDF Free Download

Statistical Learning Theory Part I : Mathematical Learning Theory (1-8) By Sumio Watanabe, Evaluation : Report Part II : Information Statistical Mechanics (9-15) By Yoshiyuki Kabashima, Evaluation : Report

Prerequisite Knowledge (1) In order to learn this lecture, you need (1) Vector space, and linear transform, and matrix computation. (2) Partial differentiation and multiple integration. f(x,y) f(x,y) dxdy x (3) Basic Probability theory. p(1) Probability function p(2) p(3) 1 2 3.. S

Prerequisite Knowledge (2) Statistical learning theory needs mathematics. If you did not learn at least one of them, then it is impossible for you to understand this lecture. Check.1 Let f(x) be an C 2 -class function of x=(x 1,x 2,,x N ) in R N. Then by using definitions f (x)=( f/ x i ) and f (x)=( 2 f/ x i x j ), there exists a * in R N such that f(x) = f(a) + ((x-a), f (a)) + (1/2) ((x-a),f (a * )(x-a)). Check.2 Let X 1, X 2,,X n be independently and identically distributed random variables which have the finite expectation value M. Then (X 1 +X 2 + +X n )/n converges to M almost surely, when n tends to infinity. Remark. If you do not know these check points, then, before participating this lecture, you should learn them in undergraduate program.

Part I Mathematical Learning Theory Sumio Watanabe

Part I - 1. Basic Concepts in Statistical Learning Theory Sumio Watanabe

Part I - 1-1. Probability Distribution and Random Variable

Probability Density function on R Definition. Let R be a set of all real values. A function p from R to R is called a probability density function if (1) For an arbitrary x in R, p(x) >=0. (2) p(x) dx =1. p(x) R 7

Example.1 p(x) Standard Normal Distribution 1 p(x) = exp( - ) 2π O x 2 2 R Formula : for a>0, exp(- ax 2 ) dx = (π/a) 1/2 8

Example.2 p(x) Uniform distribution on [a,b] a b R p(x) = 1/(b-a) (a <= x <=b) 0 (otherwise) 9

Probability Distribution on R Definition. Let p(x) be a probability density function on R. For a subset A contained in R, P(A) is defined by P(A) = p(x) dx, A Then P is called a probability distribution on R. p(x) A R 10

Example.3 p(x) R 0 1 Probability density function p(x) = 2x (0<= x<=1) = 0 (otherwise) 0.7 P([0.5, 0.7]) = 2x dx = 0.24 0.5 11

Remark. Probability and Axiom of Choice This page explains a mathematically advanced point. A student who studies introductive probability theory may skip this page. From the mathematical point of view, the axiom of choice is inconsistent with the axiom that any subset in R is measurable, hence mathematical probability theory that employs the axiom of choice needs to determine all subsets which are measurable. The family of such subsets are called a completely additive class. A student who wants to understand them should learn measure theory and mathematical probability theory. 12

Probability Density function on R N Definition. Let N be a positive integer and R N be the N dimensional real Euclidean space. A function p from R N to R is called a probability density function if (1) For an arbitrary x in R N, p(x) >=0. (2) p(x) dx =1. A probability distribution P(A) is defined for a subset A in R N P(A) = p(x) dx. A 13

Random Variable Definition. Let P be a probability distribution on R N. If a variable X in R N satisfies P({ X in A} ) = p(x) dx, A then X is called an R N -valued random variable and P and p are called a probability distribution and density function of X respectively. Also it is said that X has P and X has p. Note. In probability theory, a random variable is defined as a measurable function on a probability space. 14

Expectation Value and Variance Definition. Assume that X is an R N -valued random variable which has a probability density function p. Then the expectation value E[X] and covariance matrix V[X] are respectively defined by E[X] = x p(x) dx, V[X] = (x-e[x])(x-e[x)]) T p(x)dx = E[(X-E[X])(X-E[X]) T ] = E[XX T ]-E[X]E[X] T, where T shows the transposed vector. If N=1, then V[X] is called the variance. 15

Part I - 1-2. Conditional Probability

Simultaneous Probability Density function Definition. Let (X,Y) be an R M times R N -valued random variable which has a probability density function p(x,y), where x=(x 1,x 2,,x M ) and y=(y 1,y 2,,y N ). Then p(x,y) is called a simultaneous probability density function of (X,Y). O p(x,y) x y Simultaneous PDF shows the PDF of the pair (x,y). 17

Marginal Probability Density function Definition. Let (X,Y) be an R M times R N -valued random variable which has a simultaneous probability density function p(x,y). The marginal probability density functions p(x) and p(y) of X and Y are respectively defined by y p(y) p(x,y) p(x) = p(y) = p(x,y) dy, p(x,y) dx. p(x) Marginal PDF shows the PDF of each x or y. 18 x

Example.4 A simultaneous probability density function on R 1 times R 1, p(x,y) = (1/C) exp( - 2x 2 +2xy y 2 ), where C = exp( - 2x 2 +2xy y 2 ) dx dy = π. Marginal density functions are p(x) = (1/C) exp( - 2x 2 +2xy y 2 ) dy = 1/π 1/2 exp(-x 2 ). p(y) = (1/C) exp( - 2x 2 +2xy y 2 ) dx = 1/(2π) 1/2 exp(-y 2 /2). Formula : for a>0, exp(- ax 2 ) dx = (π/a) 1/2 19

Example.5 A simultaneous probability density function on (x,y) in R 1 times {0,1}, p(x,0) = a p 1 (x), p(x,1) = b p 2 (x), where p 1 (x) and p 2 (x) are probability density functions and a+b=1. The marginal probability density function of x is p(x) = a p 1 (x) + b p 2 (x), The marginal probability function of y is p(0) = a, p(1) = b. 20

Conditional Probability Density function Definition. Let (X,Y) be an R M and R N -valued random variable which has a simultaneous probability density function p(x,y). The conditional probability density functions p(y x) and p(x y) are respectively defined by p(y x) = p(x,y) / p(x), p(x y) = p(x,y) / p(y). Remark 1. For x s.t. p(x)=0, p(y x) is not defined. Remark 2. (Mathematically advanced point) In a general probability space, definition of conditional probability requires the division of measures, for example, Radon-Nikodym derivative. 21

Meaning of Conditional PDF p(y x) = p(x,y) / p(x) = p(x,y) / { p(x,y )dy } p(x,y) p(x,y) Conditional PDF shows the PDF of y for a fixed x. y O x 22

Example.6 A simultaneous probability density function on R 1 times R 1. p(x,y) = (1/ π) exp( - 2x 2 +2xy y 2 ), Marginal density functions are p(x) = 1/π 1/2 exp(-x 2 ). p(y) = 1/(2π) 1/2 exp(-y 2 /2). Conditional probability density functions are p(x y) = p(x,y)/p(y) = 1/(π/2) 1/2 exp(-2(x-y/2) 2 ). p(y x) = p(x,y)/p(x) = 1/π 1/2 exp(-(y-x) 2 ). Formula : for a>0, exp(- ax 2 ) dx = (π/a) 1/2 23

Example.7 A simultaneous probability density function on (x,y) in R 1 times {0,1}, p(x,0) = a p 1 (x), p(x,1) = b p 2 (x), The marginal probability density functions are p(x) = a p 1 (x) + b p 2 (x), p(0) = a, p(1) = b. The conditional probability density functions are p(x 0) = p(x,0)/p(0)= p 1 (x), p(x 1) = p(x,1)/p(1)= p 2 (x), p(0 x) = p(x,0)/p(x)= a p 1 (x) / (a p 1 (x) + b p 2 (x)), p(1 x) = p(x,1)/p(x)= b p 2 (x) / (a p 1 (x) + b p 2 (x)), 24

Bayes Theorem Theorem : (Bayes Theorem) p(x,y) = p(y x)p(x) = p(x y)p(y). Note. If p(x)=0, then p(y x) is not defined, but we define if p(x)=0, then 0*p(y x) =0. This theorem automatically obtained by the definition of the conditional probability, but there are many applications of this theorem to real world information processing. 25

Part I - 1-3. Supervised Learning Sumio Watanabe

Supervised Learning Examples Answers Teacher 8,6,2 Teacher Student Student Learn to read characters Mathematical Learning Theory 27

Mathematical Description Information Source q(x) X 1, X 2,, X n Examples q(y x) Teacher p(y x,w) Student Mathematical Learning Theory Y 1, Y 2,, Y n I optimize parameter w so that q(y x) = p(y x,w). 28

True and Estimation q(x) q(y x) X 1, X 2,, X n Y 1, Y 2,, Y n X Y q(x) q(y x)= q(x,y) p(y x,w) 29

Supervised Learning Training data X 1, X 2,, X n Y 1, Y 2,, Y n q(x,y) Unknown Information Source Test data X Y p(y x,w) Learning machine

Definition of Supervised Learning Definition. In supervised learning, an information source and a teacher are represented by q(x) and q(y x), whereas a learning machine p(y x,w) with a paramter w. A set of training data consists of { (x i,y i ) ; i=1,2,n}, which are independent and has q(x)q(y x). The number n is called the number of training data. In statistics, a set of training data is called a sample and n is referred to as a sample size. A learning machine optimizes the parameter w so that p(y x,w) approximates q(y x). 31

Supervised Learning q(y x) y Supervised learning is mathematically understood as an approximation of q(y x) by p(y x,w). O x p(y x,w) y q(y x) O p(y x,w) x 32

Example of q(x)q(y x) Training data are taken from q(x) 0 and 6 0 6 q(y x) 2018/6/5 Mathematical Learning Theory

Neural Network Example of p(y x,w) 0 6 Output units 2 Learning Machine = Hidden units 6 Input units 25 Image 25 34

Classification Training Data, n=100. Desired Output Output Layer Output Hidden Layer Input Layer Input

Data Learning in a Neural Network True Trained Neural Network 36

Contents of Part I 1. Basic Concepts in Statistical Learning 2. Neural Network 3. Learning in Neural Network, Report Writing (1) 4. Boltzmann Machine 5. Deep Learning 6. Information and Entropy, Report Writing (2) 7. Prediction Accuracy 8. Knowledge Discovery, Report Writing (3) 37