Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Similar documents
[POLS 8500] Review of Linear Algebra, Probability and Information Theory

Today. Calculus. Linear Regression. Lagrange Multipliers

01 Probability Theory and Statistics Review

Machine Learning for Large-Scale Data Analysis and Decision Making A. Week #1

The Multivariate Gaussian Distribution [DRAFT]

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Gaussian Processes for Machine Learning

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Some Probability and Statistics

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

Basic Concepts in Matrix Algebra

Final Exam # 3. Sta 230: Probability. December 16, 2012

Lecture Note 1: Probability Theory and Statistics

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning Srihari. Probability Theory. Sargur N. Srihari

Practice Examination # 3

B4 Estimation and Inference

Naïve Bayes classification

Intro to Probability. Andrei Barbu

Statistical Data Mining and Machine Learning Hilary Term 2016

Bayesian Linear Regression [DRAFT - In Progress]

Lecture 1 October 9, 2013

MA 575 Linear Models: Cedric E. Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2

Introduction to Machine Learning

Introduction to Bayesian Learning. Machine Learning Fall 2018

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

CS 195-5: Machine Learning Problem Set 1

Chris Bishop s PRML Ch. 8: Graphical Models

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

If we want to analyze experimental or simulated data we might encounter the following tasks:

Machine Learning Srihari. Information Theory. Sargur N. Srihari

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Probability and Information Theory. Sargur N. Srihari

Lecture 1: Bayesian Framework Basics

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Machine Learning (CS 567) Lecture 5

a b = a T b = a i b i (1) i=1 (Geometric definition) The dot product of two Euclidean vectors a and b is defined by a b = a b cos(θ a,b ) (2)

Statistical Learning Theory

Some Concepts of Probability (Review) Volker Tresp Summer 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Support Vector Machines

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Introduction to Computational Finance and Financial Econometrics Matrix Algebra Review

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

Lecture 2: Simple Classifiers

OR MSc Maths Revision Course

2. Matrix Algebra and Random Vectors

Introduction to Probability and Statistics (Continued)

{ p if x = 1 1 p if x = 0

p(z)

Introduction to Machine Learning

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

Review of Probability Theory

Robots Autónomos. Depto. CCIA. 2. Bayesian Estimation and sensor models. Domingo Gallardo

Probabilistic Graphical Models

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Let X and Y denote two random variables. The joint distribution of these random

Linear Regression and Discrimination

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Review (Probability & Linear Algebra)

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Basic Linear Algebra in MATLAB

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine learning - HT Maximum Likelihood

Mixtures of Gaussians. Sargur Srihari

Lecture Notes on the Gaussian Distribution

LIST OF FORMULAS FOR STK1100 AND STK1110

Chapter 2. Some Basic Probability Concepts. 2.1 Experiments, Outcomes and Random Variables

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Statistical Machine Learning Lectures 4: Variational Bayes

Probabilistic & Unsupervised Learning

Mathematical foundations. Our goal here is to present the basic results and definitions from linear algebra, Appendix A. A.

Lecture 9: PGM Learning

CS37300 Class Notes. Jennifer Neville, Sebastian Moreno, Bruno Ribeiro

COM336: Neural Computing

Machine Learning for Signal Processing Bayes Classification

Probabilistic Graphical Models

Notes on Mathematics Groups

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Probabilistic Graphical Models

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Mathematical foundations - linear algebra

Parameter estimation Conditional risk

Introduction to Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Inf2b Learning and Data

SUMMARY OF PROBABILITY CONCEPTS SO FAR (SUPPLEMENT FOR MA416)

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

INTRODUCTION TO PATTERN RECOGNITION

Stat 206: Sampling theory, sample moments, mahalanobis

The Multivariate Gaussian Distribution

Machine Learning, Fall 2012 Homework 2

Deep Learning for Computer Vision

Transcription:

Today Probability and Statistics Naïve Bayes Classification Linear Algebra Matrix Multiplication Matrix Inversion Calculus Vector Calculus Optimization Lagrange Multipliers 1

Classical Artificial Intelligence Expert Systems Theorem Provers Shakey Chess Largely characterized by determinism. 2

Modern Artificial Intelligence Fingerprint ID Internet Search Vision facial ID, object recognition Speech Recognition Asimo Jeopardy! Statistical modeling to generalize from data. 3

Two Caveats about Statistical Modeling Black Swans The Long Tail 4

Black Swans In the 17 th Century, all known swans were white. Based on evidence, it is impossible for a swan to be anything other than white. In the 18 th Century, black swans were discovered in Western Australia Black Swans are rare, sometimes unpredictable events, that have extreme impact Almost all statistical models underestimate the likelihood of unseen events. 5

The Long Tail Many events follow an exponential distribution These distributions have a very long tail. I.e. A large region with significant probability mass, but low likelihood at any particular point. Often, interesting events occur in the Long Tail, but it is difficult to accurately model behavior in this region. 6

Boxes and Balls 2 Boxes, one red and one blue. Each contain colored balls. 7

Boxes and Balls Suppose we randomly select a box, then randomly draw a ball from that box. The identity of the Box is a random variable, B. The identity of the ball is a random variable, L. B can take 2 values, r, or b L can take 2 values, g or o. 8

Boxes and Balls Given some information about B and L, we want to ask questions about the likelihood of different events. What is the probability of selecting an apple? If I chose an orange ball, what is the probability that I chose from the blue box? 9

Some basics The probability (or likelihood) of an event is the fraction of times that the event occurs out of n trials, as n approaches infinity. Probabilities lie in the range [0,1] Mutually exclusive events are events that cannot simultaneously occur. The sum of the likelihoods of all mutually exclusive events must equal 1. If two events are independent then, p(x, Y) = p(x)p(y) p(x Y) = p(x) 10

Joint Probability P(X,Y) A Joint Probability function defines the likelihood of two (or more) events occurring. Orange Green Blue box 1 3 4 Red box 6 2 8 7 5 12 Let n ij be the number of times event i and event j simultaneously occur. p(x = x i,y = y i )= n ij N 11

Generalizing the Joint Probability n ij r i = j n ij c j = i n ij n ij = N i j 12

Marginalization Consider the probability of X irrespective of Y. p(x = x j )= c j The number of instances in column j is the sum of instances in each cell L c j = n ij i=1 Therefore, we can marginalize or sum over Y: L p(x = x j )= p(x = x j,y = y i ) j=1 N 13

Conditional Probability Consider only instances where X = x j. The fraction of these instances where Y = y i is the conditional probability The probability of y given x p(y = y i X = x j )= n ij c j 14

Relating the Joint, Conditional and Marginal p(x = x i,y = y j )= n ij N = n ij ci c i N = p(y = y j X = x i )p(x = x i ) 15

Sum and Product Rules In general, we ll refer to a distribution over a random variable as p(x) and a distribution evaluated at a particular value as p(x). Sum Rule p(x) = Y p(x, Y ) Product Rule p(x, Y )=p(y X)p(X) 16

Bayes Rule p(y X) = p(x Y )p(y ) p(x) 17

Interpretation of Bayes Rule Posterior p(y X) = Likelihood p(x Y )p(y ) p(x) Prior Prior: Information we have before observation. Posterior: The distribution of Y after observing X Likelihood: The likelihood of observing X given Y 18

Boxes and Balls with Bayes Rule Assuming I m inherently more likely to select the red box (66.6%) than the blue box (33.3%). If I selected an orange ball, what is the likelihood that I selected the red box? The blue box? 19

Boxes and Balls p(b = r L = o) = = p(b = b L = o) = = p(l = o B = r)p(b = r) p(l = o) 6 2 8 3 7 12 = 6 7 p(l = o B = b)p(b = b) p(l = o) 1 1 4 3 7 12 = 1 7 20

Naïve Bayes Classification This is a simple case of a simple classification approach. Here the Box is the class, and the colored ball is a feature, or the observation. We can extend this Bayesian classification approach to incorporate more independent features. 21

Naïve Bayes Classification Some theory first. c = argmax c c = argmax c p(c x 1,x 2,...,x n ) p(x 1,x 2,...,x n c)p(c) p(x 1,x 2,...,x n ) p(x 1,x 2,...,x n c) =p(x 1 c)p(x 2 c) p(x n c) 22

Naïve Bayes Classification Assuming independent features simplifies the math. c = argmax c p(x 1 c)p(x 2 c) p(x n c)p(c) p(x 1,x 2,...,x n ) c = argmax c p(x 1 c)p(x 2 c) p(x n c)p(c) 23

Naïve Bayes Example Data HOT LIGHT SOFT RED COLD HEAVY SOFT RED HOT HEAVY FIRM RED HOT LIGHT FIRM RED COLD LIGHT SOFT BLUE COLD HEAVY FIRM BLUE HOT HEAVY FIRM BLUE HOT LIGHT FIRM BLUE HOT HEAVY FIRM????? c = argmax c p(x 1 c)p(x 2 c) p(x n c)p(c) 24

Naïve Bayes Example Data HOT LIGHT SOFT RED COLD HEAVY SOFT RED HOT HEAVY FIRM RED HOT LIGHT FIRM RED COLD LIGHT SOFT BLUE COLD HEAVY FIRM BLUE HOT HEAVY FIRM BLUE HOT LIGHT FIRM BLUE HOT HEAVY FIRM????? Prior: p(c = red) =0.5 p(c = blue) =0.5 25

Naïve Bayes Example Data HOT LIGHT SOFT RED COLD HEAVY SOFT RED HOT HEAVY FIRM RED HOT LIGHT FIRM RED COLD LIGHT SOFT BLUE COLD HEAVY SOFT BLUE HOT HEAVY FIRM BLUE HOT LIGHT FIRM BLUE HOT HEAVY FIRM????? p(hot c = red) =0.75 p(hot c = blue) =0.5 p(heavy c = red) =0.5 p(firm c = red) =0.5 p(heavy c = blue) =0.5 p(firm c = blue) =0.5 26

Naïve Bayes Example Data p(hot c = red)p(heavy c = red)p(firm c = red)p(c = red) =0.75 0.5 0.5 0.5 HOT LIGHT SOFT RED COLD HEAVY SOFT RED HOT HEAVY FIRM RED HOT LIGHT FIRM RED COLD LIGHT SOFT BLUE COLD HEAVY SOFT BLUE HOT HEAVY FIRM BLUE HOT LIGHT FIRM BLUE HOT HEAVY FIRM????? =0.09375 p(hot c = blue)p(heavy c = blue)p(firm c = blue)p(c = blue) =0.5 0.5 0.5 0.5 =0.0625 27

Continuous Probabilities So far, X has been discrete where it can take one of M values. What if X is continuous? Now p(x) is a continuous probability density function. The probability that x will lie in an interval (a,b) is: p(x (a, b)) = a b p(x)dx 28

Continuous probability example 29

Properties of probability density functions p(x) 1 p(x) = p(x)dx =1 Sum Rule Product Rule p(x, y)dy p(x, y) =p(y x)p(x) 30

Expected Values Given a random variable, with a distribution p (X), what is the expected value of X? E[x] = x p(x)x E[x] = p(x)xdx 31

Multinomial Distribution If a variable, x, can take 1-of-K states, we represent the distribution of this variable as a multinomial distribution. The probability of x being in state k is µ k K K k=1 µ k =1 p(x; µ) = k=1 µ x k k 32

Expected Value of a Multinomial The expected value is the mean values. E[x; µ] = x p(x; µ)x =(µ 0,µ 1,...,µ K 1 ) T 33

Gaussian Distribution One Dimension N(x; µ, σ 2 )= 1 2πσ 2 exp 1 (x µ)2 2σ2 D-Dimensions (x µ, Σ) = 1 exp (2π) D/2 Σ 1/2 1 2 (x µ)t Σ 1 (x µ) 34

Gaussians 35

How machine learning uses Expectation statistical modeling The expected value of a function is the hypothesis Variance The variance is the confidence in that hypothesis 36

Variance The variance of a random variable describes how much variability around the expected value there is. Calculated as the expected squared error. var[f] =E[(f(x) E[f(x)]) 2 ] var[f] =E[f(x) 2 ] E[f(x)] 2 37

Covariance The covariance of two random variables expresses how they vary together. cov[x, y] =E x,y [(x E(x))(y E[y])] = E x,y [xy] E[x]E[y] If two variables are independent, their covariance equals zero. 38

Linear Algebra Vectors A one dimensional array. If not specified, assume x is a column vector. Matrices Higher dimensional array. Typically denoted with capital letters. n rows by m columns A = x = x 0 x 1... x n 1 a 0,0 a 0,1... a 0,m 1 a 1,0 a 1,1 a 1,m 1..... a n 1,0 a n 1,1... a n 1,m 1 39

Transposition Transposing a matrix swaps columns and rows. x = x 0 x 1... x n 1 x T = x 0 x 1... x n 1 40

Transposition Transposing a matrix swaps columns and rows. a 0,0 a 0,1... a 0,m 1 a 1,0 a 1,1 a 1,m 1 A =..... a n 1,0 a n 1,1... a n 1,m 1 a 0,0 a 1,0... a n 1,0 A T a 0,1 a 1,1 a 1,m 1 =..... a 0,m 1 a 1,m 1... a n 1,m 1 41

Addition Matrices can be added to themselves iff they have the same dimensions. A and B are both n-by-m matrices. + B = a 0,0 + b 0,0 a 0,1 + b 0,1... a 0,m 1 + b 0,m 1 a 1,0 + b 1,0 a 1,1 + b 1,1 a 1,m 1 + b 1,m 1..... a n 1,0 + b n 1,0 a n 1,1 + b n 1,1... a n 1,m 1 + b n 1,m 1 42

Multiplication To multiply two matrices, the inner dimensions must be the same. An n-by-m matrix can be multiplied by an m-by-k matrix AB = C c ij = m a ik b kj k=0 43

Inversion The inverse of an n-by-n or square matrix A is denoted A -1, and has the following property. AA 1 = I Where I is the identity matrix is an n-by-n matrix with ones along the diagonal. I ij = 1 iff i = j, 0 otherwise 44

Identity Matrix Matrices are invariant under multiplication by the identity matrix. AI = A IA = A 45

Helpful matrix inversion properties (A 1 ) 1 = A (ka) 1 = k 1 A 1 (A T ) 1 =(A 1 ) T (AB) 1 = B 1 A 1 46

Norm The norm of a vector, x, represents the euclidean length of a vector. x = = n 1 i=0 x 2 i x 2 0 + x2 1 +...+ x2 n 1 47

Positive Definite-ness Quadratic form Scalar c 0 + c 1 x + c 2 x 2 Vector x T Ax Positive Definite matrix M x T Mx > 0 Positive Semi-definite x T Mx 0 48

Calculus Derivatives and Integrals Optimization 49

Derivatives A derivative of a function defines the slope at a point x. d dx f(x) or f (x) 50

Derivative Example 51

Integrals Integration is the inverse operation of derivation (plus a constant) f(x)dx = F (x)+c F (x) =f(x) Graphically, an integral can be considered the area under the curve defined by f(x) 52

Integration Example 53

Vector Calculus Derivation with respect to a matrix or vector Gradient Change of Variables with a Vector 54

Derivative w.r.t. a vector Given a vector x, and a function f(x), how can we find f (x)? f(x) :R n R 55

Derivative w.r.t. a vector Given a vector x, and a function f(x), how can we find f (x)? f(x) x = f(x) x 0 f(x) x 1... f(x) x n 1 f(x) :R n R 56

Example Derivation f(x) =x 0 +4x 1 x 2 f(x) x 0 =1 f(x) x 1 =4x 2 f(x) x 2 =4x 1 57

Example Derivation f(x) =x 0 +4x 1 x 2 f(x) x = f(x) x 0 f(x) = x 1 f(x) x 2 1 4x 2 4x 1 Also referred to as the gradient of a function. f(x) or f 58

Useful Vector Calculus identities Scalar Multiplication x (xt a) = Product Rule x (at x) =a x (AB) = A x B + A B x x (xt A)=A (Ax) =AT x 59

Useful Vector Calculus identities Derivative of an inverse A x (A 1 )= A 1 x A 1 Change of Variable f(x)dx = f(u) x u du 60

Optimization Have an objective function that we d like to maximize or minimize, f(x) Set the first derivative to zero. 61

Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject to a constraint. Two functions: An objective function to maximize An inequality that must be satisfied 62

Lagrange Multipliers Find maxima of f (x,y) subject to a constraint. f(x, y) =x +2y x 2 + y 2 =1 63

General form Maximizing: Subject to: f(x, y) g(x, y) =c Introduce a new variable, and find a maxima. Λ(x, y, λ) =f(x, y)+λ(g(x, y) c) 64

Example Maximizing: Subject to: f(x, y) =x +2y x 2 + y 2 =1 Introduce a new variable, and find a maxima. Λ(x, y, λ) =x +2y + λ(x 2 + y 2 1) 65

Example Λ(x, y, λ) x Λ(x, y, λ) y =1+2λx =0 =2+2λy =0 Λ(x, y, λ) λ =(x 2 + y 2 1) = 0 Now have 3 equations with 3 unknowns. 66

Example Eliminate Lambda 1=2λx 2=2λy 1 x =2λ = 2 y y =2x Substitute and Solve x 2 + y 2 =1 x 2 +(2x) 2 =1 5x 2 =1 x = ± 1 5 y = ± 2 5 67

Why does Machine Learning need these tools? Calculus We need to identify the maximum likelihood, or minimum risk. Optimization Integration allows the marginalization of continuous probability density functions Linear Algebra Many features leads to high dimensional spaces Vectors and matrices allow us to compactly describe and manipulate high dimension al feature spaces. 68

Why does Machine Learning need Vector Calculus these tools? All of the optimization needs to be performed in high dimensional spaces Optimization of multiple variables simultaneously Gradient Descent Want to take a marginal over high dimensional distributions like Gaussians. 69

Next Time Linear Regression and Regularization Read Chapter 1.1, 3.1, 3.3 70