Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Size: px

Start display at page:

Download "Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated"

Loreen Bradley
6 years ago
Views:

1 Fall 3 Computer Vision Overview of Statistical Tools Statistical Inference Haibin Ling Observation inference Decision Prior knowledge Bayesian Framework B / A) A) A/ Posterior Probability Prior Probability Observation Model) Model) Model Observation) Observation) Modeling Notation M for model, M i is the i-th model O=(O,, O N ) are N observations Generative model Model the prior information Model M) and/or O M) Learning or estimating model (parameters) Ignore ob. probability, e.g., classification Assume equal model probability, maximum likelihood Observation Model) Model) Observation Model) Discriminative model Model the posterior probability directly M O) Regression problem Very simple case Things are usually more complicated Observation O is a one-dimensional random variable (r.v.) E.g., height of a person (H) Output (model) is a one-dimensional value E.g., gender (G) Graphical representation Observation O is a sequence {O t } Voice signals of a sentence (V={V,,V N }) Output (model) is also a sequence E.g., words in the sentence (W={W,,W N }) Graphical representation V V 2 V V t- t Vt+ V N- V N G H G) H W V W) W 2 W t- W t Wt+ W N- W N

Hidden Markov Model V V 2 V t- V t V t+ V N- V N Tasks can be harder Observation O is not sequential An image with noise (I={I i,j }) W W W W t- W 2 t W t+ N- W

, recovered image (R={R i,j }) V W) V i W i ) Markov property Conditional independence Efficient inference dynamic programming solution polynomial time

In general, no global optimum solution Approximate solutions: Gibbs sampling, etc. Along these lines, conditional random fields, etc.

variables Nodes (V) Dependency Edges (E) A general rule Pr(V)=Pr(v pa(v )) Pr(v 2 pa(v 2 )) Pr(v N pa(v N )) HMM, MRF are very special cases Tree structure is

2 Hidden Markov Model V V 2 V t- V t V t+ V N- V N Tasks can be harder Observation O is not sequential An image with noise (I={I i,j }) W W W W t- W 2 t W t+ N- W N V W) i i ) Output is also multi-dimensional E.g., recovered image (R={R i,j }) V W) V i W i ) Markov property Conditional independence Efficient inference dynamic programming solution polynomial time complexity I i,j R i,j Images from Geman & Geman Tasks can be harder Graphical representation Markov property Inference Posterior inference Energy minimization In general, no global optimum solution Approximate solutions: Gibbs sampling, etc. Along these lines, conditional random fields, etc. I I 2 I 3 I 4 I 5 I 2 I 22 I 23 I 24 I 25 I 3 I 32 I 33 I 34 I 35 I 4 I 42 I 43 I 44 I 45 I 5 I 52 I 53 I 54 I 55 General graphical model G=<V,E> Random variables Nodes (V) Dependency Edges (E) A general rule Pr(V)=Pr(v pa(v )) Pr(v 2 pa(v 2 )) Pr(v N pa(v N )) HMM, MRF are very special cases Tree structure is usually good, with efficient solutions The inference in general is NP-hard Back to Bayesian Framework B / A) A) A/ Posterior Probability Prior Probability Supervised Learning Classification with labeled examples. Observation Model) Model) Model Observation) Observation) Discriminative modeling: go for the target directly! 2

3 Linearly Separable Classes Which one to choose What is the best criterion? Support Vector Machine Maximize the minimum distance from hyperplane to points. Points at this minimum distance are support vectors. Why SVM Why Margin? Classification error bound is negatively related to margin Why Linear? VC-dimension Classification error bound is related to VCdimension What about non-linear cases? Kernel Tough cases Not linear separable! Each data point has a class label: + ( ) y t = - ( ) Kernel techniques (w/ SVM) Kernel mapping Map to high dimensional feature spaces Linear separable in the high dimensional spaces The mapping is implicit Efficient Convex optimization Requirement for the kernel Symmetric positive definite Examples modified from A. Torralba 3

4 Boosting f f 2 f 4 Coefficients/weights Statistical Analysis M f ( x ) c m f m ( x ) m Observation strong classifier f 3 weak classifiers Prior knowledge Analysis Understanding The strong (non- linear) classifier is built as the combination of all the weak (linear) classifiers. Supervised vs Unsupervised Supervised with labels E.g., Classification Supervised vs Unsupervised Unsupervised no labels E.g., Cluster Blue or Red? Which cluster? Supervised vs Unsupervised Semi-supervised some labels E.g., cluster, classification, Clustering : k-means, initialization K=3 Cluster center Blue or Red? Revised from M. Jamali etal 4

5 Clustering : k-means, update mean More about k-means K=3 Cluster mean No guarantee to find a global optimum Local optimum is found Multiple (random) initializations, pick the best Sensitive to noise Choice of k? Potential problem for applying k-means Analyst may have a priori knowledge of k Other clustering algorithms Mean shift Dirichlet process Message passing Factor Analysis Why factor analysis Better classification E.g., easy for building classifiers Feature analysis/selection E.g., finding the dominant components with large variation Dimension reduction E.g., for space and speed efficiency Principal Component Analysis (PCA) Motivation Find new orthogonal basis that better explains data variation Intuitive steps Shift the original basis to the data mean Rotate the basis to align with the directions of variations [optional] Use anisotropic scales for each basis Principal Component Analysis (PCA) Independent Component Analysis (ICA) Example application: classification Classifier in the original basis X=(x,x2) + - Motivation What if the data looks like this? Is orthogonal basis reasonable? What about the variation now? y 2 y y f ( X ) ax bx c 0 2 otherwise After PCA projection Y=(y,y2) x 2 Solution Seeking independent components Not necessarily capture data variation Not necessarily orthogonal y 2 x 2 f ( Y ) y 0 otherwise x Limitation No closed form solution Often stuck in local optimum Slow x 5

6 Non-linear cases Source: Roweis & Saul 2000 Real data are often non-linear Kernel methods Manifold learning 6

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted