CSIE/GINM, NTU 2009/11/30 1

Size: px

Start display at page:

Download "CSIE/GINM, NTU 2009/11/30 1"

Roy Edwards
5 years ago
Views:

1 Itroductio ti to Machie Learig (Part (at1: Statistical Machie Learig Shou de Li CSIE/GINM, NTU 009/11/30 1

Syllabus of a Itro ML course ( Machie Learig, Adrew Ng, Staford, Autum 009 Supervised learig. (7 classes Supervised learig setup. LMS. Logistic regressio. Perceptro. Expoetial family.

Evaluatig ad debuggig learig algorithms. Learig theory. (3 classes Bias/variace tradeoff. Uio ad Cheroff/Hoeffdig bouds. VC dimesio. Worst case (olie learig.

2 Syllabus of a Itro ML course ( Machie Learig, Adrew Ng, Staford, Autum 009 Supervised learig. (7 classes Supervised learig setup. LMS. Logistic regressio. Perceptro. Expoetial family. Geerative learig algorithms. Gaussia discrimiati i aalysis. Ni Naive Bayes. Support vector machies. Model selectio ad feature selectio. Esemble methods: Baggig, boostig, ECOC. Evaluatig ad debuggig learig algorithms. Learig theory. (3 classes Bias/variace tradeoff. Uio ad Cheroff/Hoeffdig bouds. VC dimesio. Worst case (olie learig. Practical advice o how to use learig algorithms. Usupervised learig. (5 classes Clusterig. K meas. EM. Mixture of Gaussias. Factor aalysis. PCA. MDS. ppca. Idepedet compoets aalysis (ICA. Reiforcemet learig ad cotrol. (4 classes MDPs.Bellma equatios. Value iteratio ad policy iteratio. Liear quadratic regulatio (LQR. LQG. Q learig. Value fuctio approximatio. Policy search. Reiforce. POMDPs. HT has doe a great job teachig you Advaced SL ad Learig 009/11/30 Theory, ad my missio is to fill oe missig piece i the puzzle.

3 Why teachig Itro to ML? Whe revealig that t you have take a ML course, people would more or less expect you to have already kow somethig, E.g. Eg Naïve Bayes. Thereare some ML methods that are so commoly applied i research ad real world that you will eed to kow a little bit about them. E.g. K meas clusterig There are some ML method that are too ubelievable ad amazig to igore. E.g. EM framework. 009/11/30 3

4 To Brig you Back to the Earth Statistical Machie Learig. ( hours A Bayesia view about ML Geerative learig model. Gaussia discrimiat aalysis. Naïve Bayes Usupervised dlearig. (3 hours Clusterig: K meas. EM. Reiforcemet learig (0.5 hour Value iteratio ad policy iteratio. Q learig & SARSA 009/11/30 4

5 Theoretical ML vs. Statistical ML What you have kow: SL takes may (x,t as iputs to trai a learer f(x, the apply it to usee x k ad predict it as f(x k For example (X is 3 dimesioal: Traiig { ([1,,3], 0.1, ([,3,4],0., ([3,4,5], 0.5 } Testig: [,4,5] 0.7 However, ucertaity exist i the real world, therefore a error distributio (e.g. Gaussia is usually added: t=f(x+error. That says, it is possible to geerate differet results for same iputs, for example: Traiig {([1,,3],0.1, ([1,,3],0.,([1,,3],0.1 } Testig: [13]=? [1,,3]=? 009/11/30 Probability ad ML, Shou de Li 5

6 The Probabilistic Form of t The output t is a distributio caused by the error (assumig Gaussia term: p(t x,w,β= N(t y(x,w, β 1, β is called a precisio parameter which equals theiverse of the variace 1/σ. 009/11/30 Probability ad ML, Shou de Li 6

7 The SL process uder probability Give traiig data {X,T}, we wat to determie the ukow parameter W ad β so we will kow the distributio of y. Assumig we observed N data poits, the p(t X,W, β N = Ν ( t = 1 = y ( x p(t x,w, β * p(t 1 1, W, β 1 N β l( l( p(t X,W, β = { y ( x likelihood = 1 x, W this is called log - likelihood fuctio,w, β...* p(t t fuctio } + N N x N,W, β (l β l( l(π, 009/11/30 Probability ad ML, Shou de Li 7

8 Maximum Likelihood Estimatio (MLE Idea: tryig to adjust the ukow parameters (i.e. W ad β to maximize the likelihood fuctio or log likelihood fuctio N β l( p(t X,W, β = { y( x = 1, W t } + N (l β l(π Adjustig W to maximizig this log likelihood fuctio give Gaussia error fuctio is equivalet to fidig a W ML that miimizig the mea square error fuctio 009/11/30 Probability ad ML, Shou de Li 8

9 Maximum Likelihood Estimatio for β l( First, we calculate W ML that govers the mea of the distributio. The we use W ML i the likelihood fuctio to determie theoptimal β determie the optimal β ML β p(t X,W N ML, β 1 N = { y ( x, W ML t } + = β β 1 = 1 N N = 1 { y( x, W ML = 1 t } 0 009/11/30 Probability ad ML, Shou de Li 9

10 A SL system usig MLE 1. We first determie W as W ML that miimizes the error fuctio 1 N { (, } y x w t Ted to overfit =1 N. Usig W ML to fid β as 1 1 = β { y ( x, W ML t} N = 1 3. Predictio stage: Usig W ML ad β to costruct the distributio of t: p(t x,w,β= W N(t y(x,w N(t y(xw 1 ML, β ML 4. Predict the value of a iput x by samplig t usig the distributio i (3 The MLE approach cosistetly uderestimate the variace of the data ad ca lead to overfittig 009/11/30 Probability ad ML, Shou de Li 10

11 Bayesia Approach for Regressio Why Bayesia Approach: some w s are preferable tha others For example, the regularizatio prefers simple model (i.e. smallw s. ws. Cosequetly, p(w caot be treated as uiformly distributed 009/11/30 Probability ad ML, Shou de Li 11

12 P ( W T Bayes Rule Review = P( W X, T P ( T W * P ( W P( T = P( T X, W * P( W P ( T X P( W X, T P( T X, W * P( W X P(W X: prior probability P(Tl XW:Likelihood X,W: probability (what MLE tries to optimize, argmax w P(T X,W P(W X,T : posterior probability bili 009/11/30 Probability ad ML, Shou de Li 1 X

13 Bayesia Curve Fittig y g ( *, (, ( X W P W X T P T X W P Likelihood probability (we have already doe: β N N l( (l }, ( {, l( 1 π β β β + = = N t W x y p(t X,W Prior: Assumig idepedet of X, ad is Gaussia with mea 0 ad variace = 1/α / w w M T e X W p 1 ( ( α π α + = The the log probability of posterior will be proportio to π 009/11/30 Probability ad ML, Shou de Li 13 w w M N t W x y T N l( (l 1 l( (l }, ( { 1 α π α π β β =

14 Maximum Posterior Estimatio (MAP N β { y( x = 1 N M + 1 α, W t} + (lβ l(π + (lα l(π The best parameter set should maximize posterior probability istead of the likelihood probability. The MAP solutio for the Gaussia oise ad Gaussia Prior is to fid a W that miimize N β = 1 { y ( x, W t } + α Maximizig the posterior distributio is equivalet to miimizig the regularized sum ofsquares error fuctio with the regularizatio parameter λ=α/β 009/11/30 Probability ad ML, Shou de Li 14 w T w w T w

15 What we have discussed so far 1. Learig Phrase (MLE or MAP: Fidig W ML that maximizes the likelihood fuctio p(t X,W Fidig W that miimize the square error of loss fuctio, or Fidig W MAP that maximizes the posterior fuctio P(W T,X Fidig W that miimize the regularized sum of squares loss fuctio. Iferece Phrase: Whe a ew x comes i, usig the determied W to predict the output y 009/11/30 Probability ad ML, Shou de Li 15

16 Potetial Issues The problem of MLE: overfittig i The problem of MAP: lose iformatio P(W X,T P(W X,T P(W X,T MAP W MAP W MAP W Sice i MAP we have leared P(W X,T, why ot usig total probability theory p( t where x, X, T = p ( t w p( t x, W * p( W x, w = N ( t y ( x, W, β X, T dw 009/11/30 Probability ad ML, Shou de Li 16 11

17 The predictive distributio of t p( t x, X, T = p( t x, W * p( W X, T dw where p( t w x, w = N ( t y( x, W, β It ca be proved that whe the posterior ad p(t x,w are Gaussia, the the predictive distributio p(t x,x,t is also Gaussia with mea m(x ad variace s (x 1 009/11/30 Probability ad ML, Shou de Li 17

18 Example of predictive distributio Gree: true fuctio. Red lie: mea of the predicted dfuctio. Red zoe: oe variace from mea. 009/11/30 Probability ad ML, Shou de Li 18

19 Y(x,w from samplig posterior distributios over w 009/11/30 Probability ad ML, Shou de Li 19

20 The beefit of Statistical Learig Because it ca ot oly produce the output, but the distributio of the outputs. The distributio tells us more about the data, icludig how cofidet the system has about its predictio. It ca be used to geerate the dataset. 009/11/30 0

21 We have talked about Regressio, so how about Classificatio? 009/11/30 1

22 Two Classificatio Strategies Strategy 1: two stage methods Classificatio ca be broke dow ito two stages Iferece stage: for each C k, usig its ow traiig data to lear a model for p(c k X Decisio stage: Use p(c k X ad the loss matrix to make optimal class assigmet Strategy : Oe shot methods (or Discrimiat model Usig all traiig data to lear a fuctio that directly maps iputs x ito the output class 009/11/30

23 Two Models for Strategy 1 (1/ Model 1: Geerative Model First solve the iferece problem of determiig p(x C k for each class C k idividually. Separately ifer the prior class probabilities p(c k. Use Bayes theorem to fid the posterior class probabilities p(c k x p( x Ck p( Ck p( Ck x = p( x ote that the deomiator ca be geerated as p(x=σ p(x C k p(c k Fially use p(c k x ad decisio theory to fid the best k class assigmet. This is called geerative model sice we ca lear p(x ad p(c k,x 009/11/30 3

24 Two Approaches for Strategy 1 (/ Model l: Discrimiative i i Model Directly lear p(c k x from data ( kow othig about p(x C k, ad p(x Logistic regressio is a typical example. 009/11/30 4

25 Classificatio Models Geerative Model: learig P(C k X usig Bayes Rule First solve the iferece problem of determiig p(x C k ad p(c k for each class C k idividually. Use Bayes rule to fid the posterior class probabilities bilii p(c k x Discrimiative Model: learig P(C k X directly from data The apply decisio theory to decide which C is the best assigmet for x (e.g. Logistic Regressio Discrimiat Model: Lear a fuctiothatthat directly maps iputs x ito the output class Liear discrimiat fuctio: learig liear fuctios to separate the classes Least Squares Fisher s s liear discrimiat Perceptro Algorithm 009/11/30 5

26 Geerative vs. Discrimiative Model Geerative model dl Pros: P(x ca be used to geerate samples of iputs, which is useful for kowledge discovery & data miig (e.g. outlier detectio ad ovelty detectio. Cos: very demadig sice it has to fid the joit distributio of Ck ad x. Need a lot traiig data. Discrimiative Model Pros: ca be leared with fewer data Cos: caot learthe detail structure of the data 009/11/30 6

27 Geerative vs. Discrimiat Model (1/3 A discrimiat approach lears a discrimiat fuctio ad use it for decisio makig. It does ot lear P(C k x. However, P(C k x is useful i may aspects 1. It ca be combied with the cost fuctio to produce the fial decisio. If the cost fuctio chages, we do t eed to re trai the whole model as a discrimiat model does.. It ca be used to determie the reject regio. P(C HT x= 0.1, P(C PJ x= 0.05 P(C HT x= 0.7, P(C PJ x= /11/30 7

28 Geerative vs. Discrimiat Model (/3 Geerative Model takes care of the class prior P(y explicitly. E.g.: i cacer predictio, oly a small amout of data (e.g. 0.1 %arepositive. A ormal classifier will guess egative ad receive 99.9% 9% accuracy. Usig P(C k x ad P(C k allow us to igore the iferece from the prior durig learig. 009/11/30 8

29 Geerative vs. Discrimiat Model (3/3 Geerative model dlare btt better i terms of combiig several models: Assumig i the previous example, we have two types of iformatio for each photo: The image features (X i The social iformatio (X s It might be more effective ad meaigful to build separate models dlp(c k X i, P(C k X s for these two sets of features. Geerative allows us to combie these models as: P(C k X i,x s p C x, x P( x, x C P( C ( k i s i s k k P( x i C k Naïve bayes assumptio P( x s P( C P( C xi P( C P( C 009/11/30 9 C k k k k k x s

30 Naïve Baye Assumptio p ( Ck x = p( x C p( C Recall i Bayesia Setup, we have k p( x If we assume features of a istace are idepedet give the class (coditioally y idepedet. P( X C = P( X1, X, L X = C i=1= 1 P( X i k C Therefore, we the oly eed to kow P(X i C for each possible pair of a feature value ad class. If C ad all X i are biary, this requires specifyig i oly parameters: P(X i =true tue Ctuea C=true ad P(X i =true tue C=false ase for each X i P(X i =false C = 1 P(X i =true C k Compared to specifyig parameters without ay idepedece assumptios. 30

31 Gaussia Discrimiat Aalysis (GDA This is aother geerative model. GDA assumes p(x y is distributed accordig to a Multivariate Normal Distributio (MND. A MND i dimesios i is parameterized dby a mea vector μ R ad a covariace matrix Σ R x, also writte as N(μ, Σ. Its desity is: 009/11/30 31

32 Examples for D Multivariate Normal Distributio Σ= I Σ= 0.6I Σ= I 009/11/30 3

33 The Model for GDA (1/ p(x y is MND, p(y=0=φ, p(y=1=1 Φ (assumig differet y shares the same Σ The log likelyhood of the data is 009/11/30 33

34 The Model for GDA (/ Usig maximum likelihood estimate (MLE, we ca obtai 009/11/30 34

35 Discussio: GDA vs. Logistic Regressio I GDA, p(y x is of the form 1/(1+exp( θ T x, where θ is a fuctio of ϕ, Σ, μ. This is exactly the form of logistic regressio to model p(y x. That says, if p(x y is multivariate gaussia, the p(y x follows a logistic fuctio. However, the coverse is ot true. This implies that GDA makes stroger modelig assumptios about the data tha LR does. Traiigothesame dataset, these two oalgorithms will produce differet decisio boudaries. If p(x y is ideed Gaussia, the GDA will get better results. That says, if x is some sort of the mea value of somethig whose size is ot small, the based o cetral limit theorem, GDA should perform very well. If p(x y=1 ad p(x y=0 are both Poisso, the P(y x will be logistic. I this case, LR ca work better tha GDA. If we are sure the data is o Gaussia, we should use LR tha GDA 009/11/30 35

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it