Statistical Learning Theory

Similar documents
Bivariate distributions

Review: mostly probability and some statistics

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Review of Probability Theory

Introduction to Bayesian Statistics

01 Probability Theory and Statistics Review

Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky.

Statistical Learning Theory. Part I 5. Deep Learning

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Introduction to Probability and Stocastic Processes - Part I

Joint Probability Distributions, Correlations

Statistics for Economists Lectures 6 & 7. Asrat Temesgen Stockholm University

There are two basic kinds of random variables continuous and discrete.

Let X and Y denote two random variables. The joint distribution of these random

Chapter 3 sections. SKIP: 3.10 Markov Chains. SKIP: pages Chapter 3 - continued

Gaussian Processes for Machine Learning

Basics on Probability. Jingrui He 09/11/2007

Naïve Bayes classification

Chapter 2. Probability

Continuous Random Variables

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

EEL 5544 Noise in Linear Systems Lecture 30. X (s) = E [ e sx] f X (x)e sx dx. Moments can be found from the Laplace transform as

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Multivariate Distributions CIVL 7012/8012

EE4601 Communication Systems

EXAM # 3 PLEASE SHOW ALL WORK!

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Chapter 3 sections. SKIP: 3.10 Markov Chains. SKIP: pages Chapter 3 - continued

Lecture Note 1: Probability Theory and Statistics

PROBABILITY THEORY REVIEW

First-Order ODE: Separable Equations, Exact Equations and Integrating Factor

Lecture 11. Probability Theory: an Overveiw

Probability Review. Chao Lan

a b = a T b = a i b i (1) i=1 (Geometric definition) The dot product of two Euclidean vectors a and b is defined by a b = a b cos(θ a,b ) (2)

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Bivariate Distributions

SDS 321: Introduction to Probability and Statistics

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

PATTERN RECOGNITION AND MACHINE LEARNING

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Two Posts to Fill On School Board

ECE 4400:693 - Information Theory

Joint Probability Distributions, Correlations

Math 416 Lecture 2 DEFINITION. Here are the multivariate versions: X, Y, Z iff P(X = x, Y = y, Z =z) = p(x, y, z) of X, Y, Z iff for all sets A, B, C,

Machine Learning for Large-Scale Data Analysis and Decision Making A. Week #1

Lecture 1: Bayesian Framework Basics

Chapter 5 Joint Probability Distributions

Distributions of Functions of Random Variables. 5.1 Functions of One Random Variable

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 1

Random Signals and Systems. Chapter 3. Jitendra K Tugnait. Department of Electrical & Computer Engineering. Auburn University.

Lecture 1a: Basic Concepts and Recaps

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Statistical Machine Learning Lectures 4: Variational Bayes

Introduction to Machine Learning

Lecture Notes 3 Multiple Random Variables. Joint, Marginal, and Conditional pmfs. Bayes Rule and Independence for pmfs

2. Second-order Linear Ordinary Differential Equations

Introduction to Probability and Stocastic Processes - Part I

10 BIVARIATE DISTRIBUTIONS

More than one variable

Random Variables. Random variables. A numerically valued map X of an outcome ω from a sample space Ω to the real line R

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Chapter 16. Structured Probabilistic Models for Deep Learning

Power series solutions for 2nd order linear ODE s (not necessarily with constant coefficients) a n z n. n=0

OWELL WEEKLY JOURNAL

CMPT 882 Machine Learning

Power Series and Analytic Function

Machine learning - HT Maximum Likelihood

Chapter 4 Multiple Random Variables

Markov Chains and MCMC

Predicate Calculus - Semantics 1/4

Multiple Random Variables

2 (Statistics) Random variables

2.3 Linear Equations 69

Preliminary statistics

The Theory of Second Order Linear Differential Equations 1 Michael C. Sullivan Math Department Southern Illinois University

Midterm: CS 6375 Spring 2015 Solutions

Lecture 1: Basics of Probability

18.440: Lecture 26 Conditional expectation

MTH310 EXAM 2 REVIEW

Recitation 2: Probability

Deep Learning for Computer Vision

Introduction to Stochastic Processes

COM336: Neural Computing

Probability and Information Theory. Sargur N. Srihari

Lecture 4 : Random variable and expectation

Algebraic Geometry and Model Selection

Review of probability

Measure-theoretic probability

Appendix A : Introduction to Probability and stochastic processes

Gaussian random variables inr n

Topic 2: Probability & Distributions. Road Map Probability & Distributions. ECO220Y5Y: Quantitative Methods in Economics. Dr.

MA 575 Linear Models: Cedric E. Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2

Information geometry for bivariate distribution control

ECON 7335 INFORMATION, LEARNING AND EXPECTATIONS IN MACRO LECTURE 1: BASICS. 1. Bayes Rule. p(b j A)p(A) p(b)

Machine Learning (CS 567) Lecture 5

Introduction to Normal Distribution

Conditional Distributions

STA 111: Probability & Statistical Inference

Transcription:

Statistical Learning Theory Part I : Mathematical Learning Theory (1-8) By Sumio Watanabe, Evaluation : Report Part II : Information Statistical Mechanics (9-15) By Yoshiyuki Kabashima, Evaluation : Report

Prerequisite Knowledge (1) In order to learn this lecture, you need (1) Vector space, and linear transform, and matrix computation. (2) Partial differentiation and multiple integration. f(x,y) f(x,y) dxdy x (3) Basic Probability theory. p(1) Probability function p(2) p(3) 1 2 3.. S

Prerequisite Knowledge (2) Statistical learning theory needs mathematics. If you did not learn at least one of them, then it is impossible for you to understand this lecture. Check.1 Let f(x) be an C 2 -class function of x=(x 1,x 2,,x N ) in R N. Then by using definitions f (x)=( f/ x i ) and f (x)=( 2 f/ x i x j ), there exists a * in R N such that f(x) = f(a) + ((x-a), f (a)) + (1/2) ((x-a),f (a * )(x-a)). Check.2 Let X 1, X 2,,X n be independently and identically distributed random variables which have the finite expectation value M. Then (X 1 +X 2 + +X n )/n converges to M almost surely, when n tends to infinity. Remark. If you do not know these check points, then, before participating this lecture, you should learn them in undergraduate program.

Part I Mathematical Learning Theory Sumio Watanabe

Part I - 1. Basic Concepts in Statistical Learning Theory Sumio Watanabe

Part I - 1-1. Probability Distribution and Random Variable

Probability Density function on R Definition. Let R be a set of all real values. A function p from R to R is called a probability density function if (1) For an arbitrary x in R, p(x) >=0. (2) p(x) dx =1. p(x) R 7

Example.1 p(x) Standard Normal Distribution 1 p(x) = exp( - ) 2π O x 2 2 R Formula : for a>0, exp(- ax 2 ) dx = (π/a) 1/2 8

Example.2 p(x) Uniform distribution on [a,b] a b R p(x) = 1/(b-a) (a <= x <=b) 0 (otherwise) 9

Probability Distribution on R Definition. Let p(x) be a probability density function on R. For a subset A contained in R, P(A) is defined by P(A) = p(x) dx, A Then P is called a probability distribution on R. p(x) A R 10

Example.3 p(x) R 0 1 Probability density function p(x) = 2x (0<= x<=1) = 0 (otherwise) 0.7 P([0.5, 0.7]) = 2x dx = 0.24 0.5 11

Remark. Probability and Axiom of Choice This page explains a mathematically advanced point. A student who studies introductive probability theory may skip this page. From the mathematical point of view, the axiom of choice is inconsistent with the axiom that any subset in R is measurable, hence mathematical probability theory that employs the axiom of choice needs to determine all subsets which are measurable. The family of such subsets are called a completely additive class. A student who wants to understand them should learn measure theory and mathematical probability theory. 12

Probability Density function on R N Definition. Let N be a positive integer and R N be the N dimensional real Euclidean space. A function p from R N to R is called a probability density function if (1) For an arbitrary x in R N, p(x) >=0. (2) p(x) dx =1. A probability distribution P(A) is defined for a subset A in R N P(A) = p(x) dx. A 13

Random Variable Definition. Let P be a probability distribution on R N. If a variable X in R N satisfies P({ X in A} ) = p(x) dx, A then X is called an R N -valued random variable and P and p are called a probability distribution and density function of X respectively. Also it is said that X has P and X has p. Note. In probability theory, a random variable is defined as a measurable function on a probability space. 14

Expectation Value and Variance Definition. Assume that X is an R N -valued random variable which has a probability density function p. Then the expectation value E[X] and covariance matrix V[X] are respectively defined by E[X] = x p(x) dx, V[X] = (x-e[x])(x-e[x)]) T p(x)dx = E[(X-E[X])(X-E[X]) T ] = E[XX T ]-E[X]E[X] T, where T shows the transposed vector. If N=1, then V[X] is called the variance. 15

Part I - 1-2. Conditional Probability

Simultaneous Probability Density function Definition. Let (X,Y) be an R M times R N -valued random variable which has a probability density function p(x,y), where x=(x 1,x 2,,x M ) and y=(y 1,y 2,,y N ). Then p(x,y) is called a simultaneous probability density function of (X,Y). O p(x,y) x y Simultaneous PDF shows the PDF of the pair (x,y). 17

Marginal Probability Density function Definition. Let (X,Y) be an R M times R N -valued random variable which has a simultaneous probability density function p(x,y). The marginal probability density functions p(x) and p(y) of X and Y are respectively defined by y p(y) p(x,y) p(x) = p(y) = p(x,y) dy, p(x,y) dx. p(x) Marginal PDF shows the PDF of each x or y. 18 x

Example.4 A simultaneous probability density function on R 1 times R 1, p(x,y) = (1/C) exp( - 2x 2 +2xy y 2 ), where C = exp( - 2x 2 +2xy y 2 ) dx dy = π. Marginal density functions are p(x) = (1/C) exp( - 2x 2 +2xy y 2 ) dy = 1/π 1/2 exp(-x 2 ). p(y) = (1/C) exp( - 2x 2 +2xy y 2 ) dx = 1/(2π) 1/2 exp(-y 2 /2). Formula : for a>0, exp(- ax 2 ) dx = (π/a) 1/2 19

Example.5 A simultaneous probability density function on (x,y) in R 1 times {0,1}, p(x,0) = a p 1 (x), p(x,1) = b p 2 (x), where p 1 (x) and p 2 (x) are probability density functions and a+b=1. The marginal probability density function of x is p(x) = a p 1 (x) + b p 2 (x), The marginal probability function of y is p(0) = a, p(1) = b. 20

Conditional Probability Density function Definition. Let (X,Y) be an R M and R N -valued random variable which has a simultaneous probability density function p(x,y). The conditional probability density functions p(y x) and p(x y) are respectively defined by p(y x) = p(x,y) / p(x), p(x y) = p(x,y) / p(y). Remark 1. For x s.t. p(x)=0, p(y x) is not defined. Remark 2. (Mathematically advanced point) In a general probability space, definition of conditional probability requires the division of measures, for example, Radon-Nikodym derivative. 21

Meaning of Conditional PDF p(y x) = p(x,y) / p(x) = p(x,y) / { p(x,y )dy } p(x,y) p(x,y) Conditional PDF shows the PDF of y for a fixed x. y O x 22

Example.6 A simultaneous probability density function on R 1 times R 1. p(x,y) = (1/ π) exp( - 2x 2 +2xy y 2 ), Marginal density functions are p(x) = 1/π 1/2 exp(-x 2 ). p(y) = 1/(2π) 1/2 exp(-y 2 /2). Conditional probability density functions are p(x y) = p(x,y)/p(y) = 1/(π/2) 1/2 exp(-2(x-y/2) 2 ). p(y x) = p(x,y)/p(x) = 1/π 1/2 exp(-(y-x) 2 ). Formula : for a>0, exp(- ax 2 ) dx = (π/a) 1/2 23

Example.7 A simultaneous probability density function on (x,y) in R 1 times {0,1}, p(x,0) = a p 1 (x), p(x,1) = b p 2 (x), The marginal probability density functions are p(x) = a p 1 (x) + b p 2 (x), p(0) = a, p(1) = b. The conditional probability density functions are p(x 0) = p(x,0)/p(0)= p 1 (x), p(x 1) = p(x,1)/p(1)= p 2 (x), p(0 x) = p(x,0)/p(x)= a p 1 (x) / (a p 1 (x) + b p 2 (x)), p(1 x) = p(x,1)/p(x)= b p 2 (x) / (a p 1 (x) + b p 2 (x)), 24

Bayes Theorem Theorem : (Bayes Theorem) p(x,y) = p(y x)p(x) = p(x y)p(y). Note. If p(x)=0, then p(y x) is not defined, but we define if p(x)=0, then 0*p(y x) =0. This theorem automatically obtained by the definition of the conditional probability, but there are many applications of this theorem to real world information processing. 25

Part I - 1-3. Supervised Learning Sumio Watanabe

Supervised Learning Examples Answers Teacher 8,6,2 Teacher Student Student Learn to read characters Mathematical Learning Theory 27

Mathematical Description Information Source q(x) X 1, X 2,, X n Examples q(y x) Teacher p(y x,w) Student Mathematical Learning Theory Y 1, Y 2,, Y n I optimize parameter w so that q(y x) = p(y x,w). 28

True and Estimation q(x) q(y x) X 1, X 2,, X n Y 1, Y 2,, Y n X Y q(x) q(y x)= q(x,y) p(y x,w) 29

Supervised Learning Training data X 1, X 2,, X n Y 1, Y 2,, Y n q(x,y) Unknown Information Source Test data X Y p(y x,w) Learning machine

Definition of Supervised Learning Definition. In supervised learning, an information source and a teacher are represented by q(x) and q(y x), whereas a learning machine p(y x,w) with a paramter w. A set of training data consists of { (x i,y i ) ; i=1,2,n}, which are independent and has q(x)q(y x). The number n is called the number of training data. In statistics, a set of training data is called a sample and n is referred to as a sample size. A learning machine optimizes the parameter w so that p(y x,w) approximates q(y x). 31

Supervised Learning q(y x) y Supervised learning is mathematically understood as an approximation of q(y x) by p(y x,w). O x p(y x,w) y q(y x) O p(y x,w) x 32

Example of q(x)q(y x) Training data are taken from q(x) 0 and 6 0 6 q(y x) 2018/6/5 Mathematical Learning Theory

Neural Network Example of p(y x,w) 0 6 Output units 2 Learning Machine = Hidden units 6 Input units 25 Image 25 34

Classification Training Data, n=100. Desired Output Output Layer Output Hidden Layer Input Layer Input

Data Learning in a Neural Network True Trained Neural Network 36

Contents of Part I 1. Basic Concepts in Statistical Learning 2. Neural Network 3. Learning in Neural Network, Report Writing (1) 4. Boltzmann Machine 5. Deep Learning 6. Information and Entropy, Report Writing (2) 7. Prediction Accuracy 8. Knowledge Discovery, Report Writing (3) 37