Machine Learning.

Size: px

Start display at page:

Download "Machine Learning."

Alice Stafford
5 years ago
Views:

1 Machie Learig

2 Orgaizatioal ifo All up-to-date ifo is o the course web page (follow liks from my page). Istructors - Eric Xig - Ziv Bar-Joseph TAs: See ifo o website for recitatios, office hours etc. See web page for cotact ifo, office hours, etc. Piazza would be used for questios / commets. Make sure you are subscribed.

3 Zhitig Hu Research: large scale machie learig ad their applicatios i NLP/CV. Homepage: Cotact: zhitighu@gmail.com

5 Yutia Deg Research: large scale machie learig. Cotact:

Hao Zhag (hao@cs.cmu.edu) Fid me: GHC 8116 Office Hours: Friday 3.30 pm 4.

7 Hao Zhag Fid me: GHC 8116 Office Hours: Friday 3.30 pm 4.30 pm Iterest: Distributed Machie Learig Deep Learig Applicatios i computer visio

9 9/3 Itro to probability, MLE 9/8 No class 9/10 Classificatio, KNN 9/15 No class, Jewish ew year 9/17 Decisio trees - PS1 out 9/22 Naïve Bayes 9/24 Liear regressio 9/26 Logistic regressio 11/17 (Moday): Midterm 10/1 Perceptro, Neural etworks - PS1 due / PS2 out 10/6 Deep learig, SVM1 10/10 SVM 2 10/13 Evaluatig classifiers, Bias Variace decompositio 10/15 Esemble learig Boostig, RF PS2 due / PS3 out 10/20 Usupervised learig clusterig 10/22 Usupervised learig clusterig / project proposal due 10/27 Semi-supervised learig 10/29 Learig theory 1 - PS3 due / PS4 out 11/3 PAC learig 11/5 Graphical models, BN 11/10 BN 11/12 - Udirected graphical models / PS4 due 11/17 Midterm 11/19 HMM PS5 out 11/24 HMM iferece 12/1 MDPs / Reiforcemet learig / ps5 due 12/3 Topic models- 12/4 - Project poster sessio 12/8 Computatioal Biology 12/10 o class Itro ad classificatio (A.K.A. supervised learig ) Clusterig ( Usupervised learig ) Probabilistic represetatio ad modelig ( reasoig uder ucertaity ) Applicatios of ML

10 5 Problem sets - 40% Project - 35% Midterm - 25% Gradig

11 Class assigmets 5 Problem sets - Most cotaiig both theoretical ad programmig assigmets Projects - Groups of Ope eded. Would have to submit a proposal based o your iterest. We will also provide suggestios o the website. Recitatios - Twice a week (same cotet i both) - Expad o material leared i class, go over problems from previous classes etc.

12 What is Machie Learig? Easy part: Machie Hard part: Learig - Short aswer: Methods that ca help geeralize iformatio from the observed data so that it ca be used to make better decisios i the future

13 What is Machie Learig? Loger aswer: The term Machie Learig is used to characterize a umber of differet approaches for geeralizig from observed data: Supervised learig - Give a set of features ad labels lear a model that will predict a label to a ew feature set Usupervised learig - Discover patters i data Reasoig uder ucertaity - Determie a model of the world either from samples or as you go alog Active learig - Select ot oly model but also which examples to use

14 Paradigms of ML Supervised learig - Give D = {X i,y i } lear a model (or fuctio) F: X k -> Y k Usupervised learig Give D = {X i } group the data ito Y classes usig a model (or fuctio) F: X i -> Y j Reiforcemet learig (reasoig uder ucertaity) Give D = {eviromet, actios, rewards} lear a policy ad utility fuctios: policy: F1: {e,r} - > a utility: F2: {a,e}- > R Active learig - Give D = {X i,y i }, {X j } lear a fuctio F1 : {X j } -> x k to maximize the success of the supervised learig fuctio F2: {X i, x k }-> Y

15 Recommeder systems Primarily supervised learig

16 Semi supervised learig

17 Driveless cars Supervised ad reiforcemet learig

18 Helicopter cotrol Reiforcemet learig

19 Biology A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A T C G A T A G C A A T T C G A T A A A T C G G A T A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A A C G C T G A G C A A T T C G A T A G C A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G Which A G C A A T C part G G A T A A is C G the C T G A gee? G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A A C G C T G A G C A A T T C G A T A G C A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T C G G A T A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A T C G A T Supervised A G C A A T T C G ad A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C usupervised A A T C G G A T A A learig C G C T G A G (ca C A A T T C G A T A G C A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G also A G C use T G A active G C A A T learig) T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A T C G A T A G C A A T T C G A T A A C G C T G A G C A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A T A A C G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C T G A G C A A T T C G A T A G C A A T T C G A T A A C G C T G A G C A A T C G G A

20 Commo Themes Mathematical framework - Well defied cocepts based o explicit assumptios Represetatio - How do we ecode text? Images? Model selectio - Which model should we use? How complex should it be? Use of prior kowledge - How do we ecode our beliefs? How much ca we assume?

21 (brief) itro to probability

22 Basic otatios Radom variable - referrig to a elemet / evet whose status is ukow: A = it will rai tomorrow Domai (usually deoted by ) - The set of values a radom variable ca take: - A = The stock market will go up this year : Biary - A = Number of Steelers wis i 2012 : Discrete - A = % chage i Google stock i 2012 : Cotiuous

23 Axioms of probability (Kolmogorov s axioms) A variety of useful facts ca be derived from just three axioms: 1. 0 P(A) 1 2. P(true) = 1, P(false) = 0 3. P(A B) = P(A) + P(B) P(A B) There have bee several other attempts to provide a foudatio for probability theory. Kolmogorov s axioms are the most widely used.

24 Priors Degree of belief i a evet i the absece of ay other iformatio No rai Rai P(rai tomorrow) = 0.2 P(o rai tomorrow) = 0.8

25 Coditioal probability P(A = 1 B = 1): The fractio of cases where A is true if B is true P(A = 0.2) P(A B = 0.5)

26 Coditioal probability I some cases, give kowledge of oe or more radom variables we ca improve upo our prior belief of aother radom variable For example: p(slept i movie) = 0.5 p(slept i movie liked movie) = 1/4 p(did t sleep i movie liked movie) = 3/4 Slept Liked

27 Joit distributios The probability that a set of radom variables will take a specific value is their joit distributio. Notatio: P(A B) or P(A,B) Example: P(liked movie, slept) If we assume idepedece the P(A,B)=P(A)P(B) However, i may cases such a assumptio maybe too strog (more later i the class)

28 Joit distributio (cot) P(class size > 20) = 0.6 P(summer) = 0.4 P(class size > 20, summer) =? Evaluatio of classes Size Time Eval 30 R 2 70 R 1 12 S 2 8 S 3 56 R 1 24 S 2 10 S 3 23 R 3 9 R 2 45 R 1

29 Joit distributio (cot) P(class size > 20) = 0.6 P(summer) = 0.4 P(class size > 20, summer) = 0.1 Evaluatio of classes Size Time Eval 30 R 2 70 R 1 12 S 2 8 S 3 56 R 1 24 S 2 10 S 3 23 R 3 9 R 2 45 R 1

30 Joit distributio (cot) P(class size > 20) = 0.6 P(eval = 1) = 0.3 P(class size > 20, eval = 1) = 0.3 Size Time Eval 30 R 2 70 R 1 12 S 2 8 S 3 56 R 1 24 S 2 10 S 3 23 R 3 9 R 2 45 R 1

31 Joit distributio (cot) Evaluatio of classes P(class size > 20) = 0.6 P(eval = 1) = 0.3 P(class size > 20, eval = 1) = 0.3 Size Time Eval 30 R 2 70 R 1 12 S 2 8 S 3 56 R 1 24 S 2 10 S 3 23 R 3 9 R 2 45 R 1

32 Chai rule The joit distributio ca be specified i terms of coditioal probability: P(A,B) = P(A B)*P(B) Together with Bayes rule (which is actually derived from it) this is oe of the most powerful rules i probabilistic reasoig

33 Bayes rule Oe of the most importat rules for this class. Derived from the chai rule: P(A,B) = P(A B)P(B) = P(B A)P(A) Thus, P( A B) P( B A) P( A) P( B) Thomas Bayes was a Eglish clergyma who set out his theory of probability i 1764.

34 Bayes rule (cot) Ofte it would be useful to derive the rule a bit further: A A P A B P A P A B P B P A P A B P A B P ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( This results from: P(B) = A P(B,A) A B A B P(B,A=1) P(B,A=0)

35 Desity estimatio

36 Desity Estimatio A Desity Estimator lears a mappig from a set of attributes to a Probability Iput data for a variable or a set of variables Desity Estimator Probability

37 Desity estimatio Estimate the distributio (or coditioal distributio) of a radom variable Types of variables: - Biary coi flip, alarm - Discrete dice, car model year - Cotiuous height, weight, temp.,

38 Whe do we eed to estimate desities? Desity estimators ca do may good thigs - Ca sort the records by probability, ad thus spot weird records (aomaly detectio) - Ca do iferece: P(E1 E2) Medical diagosis / Robot sesors - Igrediet for Bayes etworks ad other types of ML methods

39 Desity estimatio Biary ad discrete variables: Cotiuous variables: Easy: Just cout! Harder (but just a bit): Fit a model

40 Learig a desity estimator for discrete variables P ˆ (x u) #records i which x i u i total umber of records A trivial learig algorithm! But why is this true?

41 Maximum Likelihood Priciple We ca defie the likelihood of the data give the model as follows: k1 P ˆ (dataset M) P ˆ (x x x M) P ˆ (x M) 1 2 k For example M is - The probability of head for a coi flip - The probabilities of observig 1,2,3,4 ad 5 for a dice - etc. M is our model (usually a collectio of parameters)

42 Maximum Likelihood Priciple Our goal is to determie the values for the parameters i M We ca do this by maximizig the probability of geeratig the observed samples For example, let be the probabilities for a coi flip The L(x 1,,x ) = p(x 1 ) p(x ) The observatios (differet flips) are assumed to be idepedet For such a coi flip with P(H)=q the best assigmet for h is Why? P ˆ (dataset M) P ˆ (x x x M) P ˆ (x M) 1 2 k argmax q = #H/#samples k1

43 Maximum Likelihood Priciple: Biary variables For a biary radom variable A with P(A=1)=q argmax q = #1/#samples Why? Data likelihood: P( D M) q 1 (1 q) 2 We would like to fid: arg max q q 1 (1 q) 2 Omittig terms that do ot deped o q

44 Data likelihood: We would like to fid: Maximum Likelihood Priciple 2 1 ) (1 ) ( q q M D P 2 1 ) (1 arg max q q q ) (1 0 ) ) (1 ( ) (1 0 ) (1 ) (1 0 ) (1 ) (1 ) ( q q q q q q q q q q q q q q q q q q q q q

45 Log Probabilities Whe workig with products, probabilities of etire datasets ofte get too small. A possible solutio is to use the log of probabilities, ofte termed log likelihood log ˆ P (dataset M) log P ˆ (x M) log P ˆ (x M) k k k1 Maximizig this likelihood fuctio is the same as maximizig P(dataset M) k1 Log values betwee 0 ad 1 I some cases movig to log space would also make computatio easier (for example, removig the expoets)

46 Desity estimatio Biary ad discrete variables: Cotiuous variables: Easy: Just cout! Harder (but just a bit): Fit a model But what if we oly have very few samples?

47 How much do grad studets sleep? Lets try to estimate the distributio of the time studets sped sleepig (outside class).

48 Frequecy Possible statistics X Sleep time Mea of X: E{X} 7.03 Variace of X: Var{X} = E{(X-E{X})^2} Sleep Hours Sleep

49 GPA Covariace: Sleep vs. GPA Co-Variace of X1, X2: Covariace{X1,X2} = E{(X1-E{X1})(X2-E{X2})} = Sleep / GPA Sleep / GPA Sleep hours

50 Statistical Models Statistical models attempt to characterize properties of the populatio of iterest For example, we might believe that repeated measuremets follow a ormal (Gaussia) distributio with some mea µ ad variace 2, x ~ N(µ, 2 ) where ( x x ) ( ) e 2 2 p 2 2 ad =(µ, 2 ) defies the parameters (mea ad variace) of the model. 1 2

51 The Parameters of Our Model A statistical model is a collectio of distributios; the parameters specify idividual distributios x ~ N(µ, 2 ) We eed to adjust the parameters so that the resultig distributio fits the data well

52 The Parameters of Our Model A statistical model is a collectio of distributios; the parameters specify idividual distributios x ~ N(µ, 2 ) We eed to adjust the parameters so that the resultig distributio fits the data well

53 Frequecy Computig the parameters of our model Lets assume a Guassia distributio for our sleep data How do we compute the parameters of the model? 12 Sleep Sleep Hours

54 Maximum Likelihood Priciple We ca fit statistical models by maximizig the probability of geeratig the observed samples: L(x 1,,x ) = p(x 1 ) p(x ) (the samples are assumed to be idepedet) I the Gaussia case we simply set the mea ad the variace to the sample mea ad the sample variace: 1 i 1 xi 2 1 ( i 1 xi ) 2 Why?

55 Importat poits Radom variables Chai rule Bayes rule Joit distributio, idepedece, coditioal idepedece MLE

56 Probability Desity Fuctio Discrete distributios Cotiuous: Cumulative Desity Fuctio (CDF): F(a) f(x) a x

57 Total probability Cumulative Desity Fuctios Probability Desity Fuctio (PDF) Properties: F(x)

58 Expectatios Mea/Expected Value: Variace: I geeral:

59 Multivariate Joit for (x,y) Margial: Coditioals: Chai rule:

60 Bayes Rule Stadard form: Replacig the bottom:

61 Biomial Distributio: Mea/Var:

62 Uiform Aythig is equally likely i the regio [a,b] Distributio: Mea/Var a b

63 Gaussia (Normal) If I look at the height of wome i coutry xx, it will look approximately Gaussia Small radom oise errors, look Gaussia/Normal Distributio: Mea/var

64 Why Do People Use Gaussias Cetral Limit Theorem: (loosely) - Sum of a large umber of IID radom variables is approximately Gaussia

65 Multivariate Gaussias Distributio for vector x PDF:

66 Multivariate Gaussias 1 cov( x, x ) ( x )( x ) 1 2 1, i 1 2, i 2 i1

67 Covariace examples Ati-correlated Correlated Idepedet (almost) Covariace: -9.2 Covariace: Covariace: 0.6

68 Sum of Gaussias The sum of two Gaussias is a Gaussia:

Probability and MLE.

Probability and MLE. 10-701 Probability ad MLE http://www.cs.cmu.edu/~pradeepr/701 (brief) itro to probability Basic otatios Radom variable - referrig to a elemet / evet whose status is ukow: A = it will rai tomorrow Domai