Introduction. Le Song. Machine Learning I CSE 6740, Fall 2013

Size: px

Start display at page:

Download "Introduction. Le Song. Machine Learning I CSE 6740, Fall 2013"

Kristopher Atkinson
6 years ago
Views:

1 Introduction Le Song Machine Learning I CSE 6740, Fall 2013

2 What is Machine Learning (ML) Study of algorithms that improve their performance at some task with experience 2

3 Common to Industrial scale problems 13 million wikipedia pages 800 million users 6 billion photos 24 hours video uploaded per minutes > 1 trillion webpages 3

4 Organizing Images Image Databases What are the desired outcomes? What are the inputs (data)? What are the learning paradigms? 4

5 Visualize Image Relations Each image has thousands or millions of pixels. What are the desired outcomes? What are the inputs (data)? What are the learning paradigms? 5

6 Organizing documents Reading, digesting, and categorizing a vast text database is too much for human! We want: What are the desired outcomes? What are the inputs (data)? What are the learning paradigms? 6

7 Weather Prediction Predict Numeric values: 40 F Wind: NE at 14 km/h Humidity: 83% Predict What are the desired outcomes? What are the inputs (data)? What are the learning paradigms? 7

8 Face Detection What are the desired outcomes? What are the inputs (data)? What are the learning paradigms? 8

9 Understanding brain activity What are the desired outcomes? What are the inputs (data)? What are the learning paradigms? 9

10 Product Recommendation What are the desired outcomes? What are the inputs (data)? What are the learning paradigms? 10

11 Handwritten digit recognition/text annotation Inter-character dependency What are the desired outcomes? What are the inputs (data)? Inter-word dependency Aoccdrnig to a sudty at Cmabrigde Uinervtisy, it deosn t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a ttoal mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe. What are the learning paradigms? 11

12 Similar problem: speech recognition Models Hidden Markov Models Text Machine Learning is the preferred method for speech recognition Audio signals 12

13 Similar problem: bioinformatics 13 acatcgctgcgtttcggcagctaattgccttttagaaattattttcccatttcgagaaactcgtgtgggatgccggatgcggctttcaatcacttctggcccgggatcggattgggtcacattgtctgcgggctctattgtctcgatccgc gcgcagttcgcgtgcttagcggtcagaaaggcagagattcggttcggattgatgcgctggcagcagggcacaaagatctaatgactggcaaatcgctacaaataaattaaagtccggcggctaattaatgagcggactgaagccactttgg ttaaccaaaaaacagcagataaacaaaaacggcaaagaaaattgccacagagttgtcacgctttgttgcacaaacatttgtgcagaaaagtgaaaagcttttagccattattaagtttttcctcagctcgctggcagcacttgcgaatgta tgatgttcctcataaatgaaaattaatgtttgctctacgctccaccgaactcgcttgtttgggggattggctggctaatcgcggctagatcccaggcggtataaccttttcgcttcatcagttgtgaaaccagatggctggtgttttggca agcggactcccctcgaacgctctcgaaatcaagtggctttccagccggcccgctgggccgctcgcccactggaccggtattcccaggccaggccacactgtaccgcaccgcataatcctcgccagactcggcgctgataaggcccaatgtc ctccgcaggcgtctatttatgccaaggaccgttcttcttcagctttcggctcgagtatttgttgtgccatgttggttacgatgccaatcgcggtacagttatgcaaatgagcagcgaataccgctcactgacaatgaacggcgtcttgtca attcatgctgacattcatattcattcctttggttttttgtcttcgacggactgaaaagtgcggagagaaacccaaaaacagaagcgcgcaaagcgccgttaatatgcgaactcagcgaactcattgaagttatcacaacaccatatccata atatccatatcaatatcaatatcgctattattaacgatcatgctctgctgatcaagtattcagcgctgcgctagattcgacagattgaatcgagctcaatagactcaacagactccactcgacagatgcgcaatgccaaggacaattgccg ggagtaaacgaggcgtatgcgcaacctgcacctggcggacgcggcgtatgcgcaatgtgcaattcgcttaccttctcgttgcgggtcaggaactcccagatgggaatggccgatgacgagctgatctgaatgtggaaggcgcccagcaggc agattactttcgccgcagtcgtcatggtgtcgttgctgcttttatgttgcgtactccgcactacacggagagttcaggggattcgtgctccgtgatctgtgatccgtgttccgtgggtcaattgcacggttcggttgtgtaaccttcgtgt ctttttttttagggcccaataaaagcgcttttgtggcggcttgatagattatcacttggtttcggtggctagccaagtggctttcttctgtccgacgcacttaattgaattaaccaaacaacgagcgtggccaattcgtattatcgctgtt acgtgtgtctcagcttgaaacgcaaaagcttgtttcacacatcggtttctcggcaagatgggggagtcagtcggtctagggagaggggcgcccaccagtcgatcacgaaaacggcgaattccaagcgaaacggaaacggagcgagcactat gtactatgtcgaacaaccgatcgcggcgatgtcagtgagtcgtcttcggacagcgctggcgctccacacgtatttaagctctgagatcggctttgggagagcgcagagagcgccatcgcacggcagagcgaaagcggcagtgagcgaaagc agcggcagcgggtgggggatcgggagccccccgaaaaaaacagaggcgcacgtcgatgccatcggggaattggaacctcaatgtgtgggaatgtttaaatattctgtgttaggtagtgtagtttcatagactatagattctcatacagatt agtccttcgagccgattatacacgacagcaaaatatttcagtcgcgcttgggcaaaaggcttaagcacgactcccagtccccccttacatttgtcttcctaagcccctggagccactatcaaacttgttctacgcttgcactgaaaataga ccaaagtaaacaatcaaaaagaccaaaaacaataacaaccagcaccgagtcgaacatcagtgaggcattgcaaaaatttcaaagtcaagtttgcgtcgtcatcgcgtctgagtccgatcaagccgggcttgtaattgaagttgttgatgag tactggattgtggcgaattctggtcagcatacttaacagcagcccgctaattaagcaaaataaacatatcaaattccagaatgcgacggcgccatcatcctgtttgggaattcaattcgcgggcagatcgtttaattcaattaaaaggtag aaagggagcagaagaatgcgatcgctggaatttcctaacatcacggaccccataaatttgataagcccgagctcgctgcgttgagtcagccaccccacatccccaaatccccgccaaaagaagacagctgggttgttgactcgccagattg ttgcagtggagtggacctggtcaaagaagcaccgttaatgtgctgattccattcgattccatccgggaatgcgataaagaaaggctctgatccaagcaactgcaatccggatttcgattttctctttccatttggttttgtatttacgtac agcattctaatgaagacttggagaagacttacgttatattcagaccatcgtgcgatagaggatgagtcatttccatatggccgaaatttattatgtttactatcgtttttagaggtgttttttggacttaccaaaagaggcatttgttttc tcaactgaaaagatatttaaattttttcttggaccattttcaaggttccggatatatttgaaacacactagctagcagtgttggtaagttacatgtatttctataatgtcatattcctttgtccgtattcaaatcgaatactccacatctc tgtacttgaggaattggcgatcgtagcgatttcccccgccgtaaagttcctgatcctcgttgtttttgtacatcataaagtccggattctgctcgtcgccgaagatgggaacgaagctgccaaagctgagagtctgcttgaggtgctggtc tcccagctggataaccttgctgtacagatcggcatctgcctggagggcacgatcgaaatccttccagtggacgaacttcacctgctcgctgggaatagcgttgttgtcaagcagctcaaggagcgtattcgagttgacgggctgcaccacg tgctccttcgctggggattcccctgcgggtaagcgccgcttgcttggactcgtttccaaatcccatagccacgccagcagaggagtaacagagctcwhereisthegenetgattaaaaatatcctttaagaaagcccatgggtataactt ctgcgtcctatgcgaggaatggtctttaggttctttatggcaaagttctcgcctcgcttgcccagccgcggtacgttcttggtgatctttaggaagaatcctggactactgtcgtctgcctggcttatggccacaagacccaccaagagcg ggactgttatgattctcatgctgatgcgactgaagcttcacctgactcctgctccacaattggtggcctttatatagcgagatccacccgcatcttgcgtggaatagaaatgcgggtgactccaggaattagcattatcgatcggaaagtg taaaactgaactaacctgacctaaatgcctggccataattaagtgcatacatacacattacattacttacatttgtataagaactaaattttatagtacataccacttgcgtatgtaaatgcttgtcttttctcttatatacgttttataa ccagcatattttacgtaaaaacaaaacggtaatgcgaacataacttatttattggggcccggaccgcaaaccggccaaacgcgtttgcacccataaaaacataagggcaacaaaaaaattgttaagctgttgtttatttttgcaatcgaaa gctcaaatagctgcgatcactcgggagcagggtaaagtcgcctcgaaacaggaagctgaagcatcttctataaatacactcaaagcgatcattccgaggcgagtctggttagaaatttacatggactgcaaaaaggtatagccccacaaac acatcgctgcgtttcggcagctaattgccttttagaaattattttcccatttcgagaaactcgtgtgggatgccggatgcggctttcaatcacttctggcccgggatcggattgggtcacattgtctgcgggctctattgtctcgatccgc gcgcagttcgcgtgcttagcggtcagaaaggcagagattcggttcggattgatgcgctggcagcagggcacaaagatctaatgactggcaaatcgctacaaataaattaaagtccggcggctaattaatgagcggactgaagccactttgg ttaaccaaaaaacagcagataaacaaaaacggcaaagaaaattgccacagagttgtcacgctttgttgcacaaacatttgtgcagaaaagtgaaaagcttttagccattattaagtttttcctcagctcgctggcagcacttgcgaatgta tgatgttcctcataaatgaaaattaatgtttgctctacgctccaccgaactcgcttgtttgggggattggctggctaatcgcggctagatcccaggcggtataaccttttcgcttcatcagttgtgaaaccagatggctggtgttttggca agcggactcccctcgaacgctctcgaaatcaagtggctttccagccggcccgctgggccgctcgcccactggaccggtattcccaggccaggccacactgtaccgcaccgcataatcctcgccagactcggcgctgataaggcccaatgtc ctccgcaggcgtctatttatgccaaggaccgttcttcttcagctttcggctcgagtatttgttgtgccatgttggttacgatgccaatcgcggtacagttatgcaaatgagcagcgaataccgctcactgacaatgaacggcgtcttgtca attcatgctgacattcatattcattcctttggttttttgtcttcgacggactgaaaagtgcggagagaaacccaaaaacagaagcgcgcaaagcgccgttaatatgcgaactcagcgaactcattgaagttatcacaacaccatatccata atatccatatcaatatcaatatcgctattattaacgatcatgctctgctgatcaagtattcagcgctgcgctagattcgacagattgaatcgagctcaatagactcaacagactccactcgacagatgcgcaatgccaaggacaattgccg ggagtaaacgaggcgtatgcgcaacctgcacctggcggacgcggcgtatgcgcaatgtgcaattcgcttaccttctcgttgcgggtcaggaactcccagatgggaatggccgatgacgagctgatctgaatgtggaaggcgcccagcaggc agattactttcgccgcagtcgtcatggtgtcgttgctgcttttatgttgcgtactccgcactacacggagagttcaggggattcgtgctccgtgatctgtgatccgtgttccgtgggtcaattgcacggttcggttgtgtaaccttcgtgt ctttttttttagggcccaataaaagcgcttttgtggcggcttgatagattatcacttggtttcggtggctagccaagtggctttcttctgtccgacgcacttaattgaattaaccaaacaacgagcgtggccaattcgtattatcgctgtt acgtgtgtctcagcttgaaacgcaaaagcttgtttcacacatcggtttctcggcaagatgggggagtcagtcggtctagggagaggggcgcccaccagtcgatcacgaaaacggcgaattccaagcgaaacggaaacggagcgagcactat gtactatgtcgaacaaccgatcgcggcgatgtcagtgagtcgtcttcggacagcgctggcgctccacacgtatttaagctctgagatcggctttgggagagcgcagagagcgccatcgcacggcagagcgaaagcggcagtgagcgaaagc agcggcagcgggtgggggatcgggagccccccgaaaaaaacagaggcgcacgtcgatgccatcggggaattggaacctcaatgtgtgggaatgtttaaatattctgtgttaggtagtgtagtttcatagactatagattctcatacagatt agtccttcgagccgattatacacgacagcaaaatatttcagtcgcgcttgggcaaaaggcttaagcacgactcccagtccccccttacatttgtcttcctaagcccctggagccactatcaaacttgttctacgcttgcactgaaaataga ccaaagtaaacaatcaaaaagaccaaaaacaataacaaccagcaccgagtcgaacatcagtgaggcattgcaaaaatttcaaagtcaagtttgcgtcgtcatcgcgtctgagtccgatcaagccgggcttgtaattgaagttgttgatgag tactggattgtggcgaattctggtcagcatacttaacagcagcccgctaattaagcaaaataaacatatcaaattccagaatgcgacggcgccatcatcctgtttgggaattcaattcgcgggcagatcgtttaattcaattaaaaggtag aaagggagcagaagaatgcgatcgctggaatttcctaacatcacggaccccataaatttgataagcccgagctcgctgcgttgagtcagccaccccacatccccaaatccccgccaaaagaagacagctgggttgttgactcgccagattg ttgcagtggagtggacctggtcaaagaagcaccgttaatgtgctgattccattcgattccatccgggaatgcgataaagaaaggctctgatccaagcaactgcaatccggatttcgattttctctttccatttggttttgtatttacgtac agcattctaatgaagacttggagaagacttacgttatattcagaccatcgtgcgatagaggatgagtcatttccatatggccgaaatttattatgtttactatcgtttttagaggtgttttttggacttaccaaaagaggcatttgttttc tcaactgaaaagatatttaaattttttcttggaccattttcaaggttccggatatatttgaaacacactagctagcagtgttggtaagttacatgtatttctataatgtcatattcctttgtccgtattcaaatcgaatactccacatctc tgtacttgaggaattggcgatcgtagcgatttcccccgccgtaaagttcctgatcctcgttgtttttgtacatcataaagtccggattctgctcgtcgccgaagatgggaacgaagctgccaaagctgagagtctgcttgaggtgctggtc tcccagctggataaccttgctgtacagatcggcatctgcctggagggcacgatcgaaatccttccagtggacgaacttcacctgctcgctgggaatagcgttgttgtcaagcagctcaaggagcgtattcgagttgacgggctgcaccacg tgctccttcgctggggattcccctgcgggtaagcgccgcttgcttggactcgtttccaaatcccatagccacgccagcagaggagtaacagagctctgaaaacagttcatggtttaaaaatatcctttaagaaagcccatgggtataactt ctgcgtcctatgcgaggaatggtctttaggttctttatggcaaagttctcgcctcgcttgcccagccgcggtacgttcttggtgatctttaggaagaatcctggactactgtcgtctgcctggcttatggccacaagacccaccaagagcg ggactgttatgattctcatgctgatgcgactgaagcttcacctgactcctgctccacaattggtggcctttatatagcgagatccacccgcatcttgcgtggaatagaaatgcgggtgactccaggaattagcattatcgatcggaaagtg taaaactgaactaacctgacctaaatgcctggccataattaagtgcatacatacacattacattacttacatttgtataagaactaaattttatagtacataccacttgcgtatgtaaatgcttgtcttttctcttatatacgttttataa Where is the gene?

14 Spam Filtering What are the desired outcomes? What are the inputs (data)? What are the learning paradigms? 14

15 Similar problem: webpage classification Company homepage vs. University homepage 15

16 Robot Control Now cars can find their own ways! What are the desired outcomes? What are the inputs (data)? What are the learning paradigms? 16

17 Nonlinear classifier Nonlinear Decision Boundaries Linear SVM Decision Boundaries 17

18 Nonconventional clusters Need more advanced methods, such as kernel methods or spectral clustering to work 18

19 Syllabus Cover a number of most commonly used machine learning algorithms in sufficient amount of details in their mechanisms. Organization Unsupervised learning (data exploration) Clustering, dimensionality reduction, density estimation, novelty detection Supervised learning (predictive models) Classifications, regressions Complex models (dealing with nonlinearity, combine models etc) Kernel methods, graphical models, boosting 19

20 Prerequisites Probabilities Distributions, densities, marginalization, conditioning. Basic statistics Moments, classification, regression, maximum likelihood estimation Algorithms Dynamic programming, basic data structures, complexity Programming Mostly your choice of language, but Matlab will be very useful The class will be fast paced Ability to deal with abstract mathematical concepts 20

21 Textbooks Textbooks: Pattern Recognition and Machine Learning, Chris Bishop The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Trevor Hastie, Robert Tibshirani, Jerome Friedman Machine Learning, Tom Mitchell 21

22 Grading 6 assignments (60%) Approximately 1 assignment every 4 lectures Start early Midterm exam (20%) Final exam (20%) Project for advanced students Can be used to replace exams Based on student experience and lecturer interests. 22

23 Homeworks Zero credit after each deadline All homeworks must be handed in, even for zero credit Collaboration You may discuss the questions Each student writes their own answers Write on your homework anyone with whom you collaborate Each student must write their own codes for the programming part 23

24 Staff Instructor: Le Song, Klaus 1340 TA: Joonseok, Klaus 1305, Nan Du, Klaus 1305 Guest Lecturer: TBD Administrative Assistant: Mimi Haley, Klaus 1321 Mailing list: More information: 24

25 Today Probabilities Independence Conditional Independence 25

26 Random Variables (RV) Data may contain many different attributes Age, grade, color, location, coordinate, time Upper-case for random variables (eg. X, Y), lower-case for values (eg. x, y) P(X) for distribution, p(x) for density Val(X) = possible values of random variable X For discrete (categorical): P(X = x i ) i=1 Val(X) = 1 For continuous: p X = x dx = 1 Val X P x 0 Shorthand: P(x) for P(X = x) 26

27 Interpretations of probability Frequentists P(x) is the frequency of x in the limit Many arguments against this interpretation What is the frequency of the event it will rain tomorrow? Subjective interpretation P(x) is my degree of belief that x will happen What does degree of belief mean? If P(x) = 0.8, then I am willing to bet For this class, we don t care the type of interpretation. 27

28 Conditional probability After we have seen x, how do we feel y will happen? P y x means P(Y = y X = x) A conditional distribution are a family of distributions For each X = x, it is a distribution P(Y x) 28

29 Two of the most important rules: I. The chain rule P y, x = P y x P x More generally: P x 1, x 2,, x k = P x 1 P x 2 x 1 P x k x k 1,, x 2, x 1 29

30 Two of the most important rules: II. Bayes rule likelihood Prior P y x = P x y P y P x = P(x,y) x Val X P(x,y) posterior Normalization constant More generally, additional variable z: P(y x, z) = P x y,z P(y z) P(x z) 30

31 Independence X and Y independent, if P(Y X) = P(Y) P Y X = P Y (X Y) Proposition: X and Y independent if and only if P(X, Y) = P(X)P(Y) 31

32 Conditional independence Independence is rarely true; conditional independence is more prevalent X and Y conditionally independent given Z if P(Y X, Z) = P(Y Z) P(Y X, Z) = P(Y Z) (X Y Z) X Y Z if and only if P X, Y Z = P X Z P Y Z 32

33 Joint distribution, marginalization Two random variables Grades (G) & Intelligence (I) P G, I = G I VH H A B For n binary variables, the table (multiway array) gets really big P X 1, X 2,, X n has 2 n entries! Marginalization Compute marginal over a single variable P(G = B) = P(G = B, I = VH) + P(G = B, I = H) =

34 Marginalization the general case Compute marginal distribution P X i from P X 1, X 2,, X i, X i+1,, X n P X 1, X 2,, X i = P X 1, X 2,, X i, x i+1, x n x i+1,,x n P X i = P x 1,, x i 1, X i x 1,,xi 1 If binary variables, need to sum over 2 n 1 terms! 34

Example problem Estimate the probability θ of landing in heads using a biased coin Given a sequence of N independently and identically distributed (iid) flips Eg.

35 Example problem Estimate the probability θ of landing in heads using a biased coin Given a sequence of N independently and identically distributed (iid) flips Eg., D = x 1, x 2,, x N = {1,0,1,, 0}, x i {0,1} Model: P x θ P(x θ ) = = θ x 1 θ 1 x 1 θ, for x = 0 θ, for x = 1 Likelihood of a single observation x i? P x i θ = θ x i 1 θ 1 x i 35

36 Frequentist Parameter Estimation Frequentists think of a parameter as a fixed, unknown constant, not a random variable Hence different objective estimators, instead of Bayes rule These estimators have different properties, such as being unbiased, minimum variance, etc. A very popular estimator is the maximum likelihood estimator (MLE), which is simple and has good statistical properties N θ = argmax θ P D θ = argmax θ i=1 P(x i θ) 36

37 MLE for Biased Coin Objective function, log likelihood l θ; D = log P D θ = log θ n h 1 θ n t = n h log θ + N n h log 1 θ We need to maximize this w.r.t. θ Take derivatives w.r.t. θ l = n h N n h θ θ 1 θ = 0 θ MLE = n h N or θ MLE = 1 N i x i 37

prior marginal likelihood X N N For iid data, the likelihood is P D θ = P(x i θ) i=1 N i=1 θ x i 1 θ 1 x i = θ i x i 1 θ i 1 x i = θ

38 Bayesian Parameter Estimation Bayesian treat the unknown parameters as a random variable, whose distribution can be inferred using Bayes rule: P(θ D) = P D θ P(θ) P(D) = P D θ P(θ) P D θ P θ dθ θ The crucial equation can be written in words Posterior = likelihood prior marginal likelihood X N N For iid data, the likelihood is P D θ = P(x i θ) i=1 N i=1 θ x i 1 θ 1 x i = θ i x i 1 θ i 1 x i = θ #head 1 θ #tail The prior P θ encodes our prior knowledge on the domain Different prior P θ will end up with different estimate P(θ D)! 38

39 Bayesian estimation for biased coin Prior over θ, Beta distribution P θ; α, β = Γ α+β Γ a Γ β θα 1 1 θ β 1 When x is discrete Γ x + 1 = xγ x = x! Posterior distribution of θ P θ x 1,, x N θ n h = P x 1,,x N θ P θ P x 1,,x N 1 θ n tθ α 1 1 θ β 1 = θ n h+α 1 1 θ n t+β 1 α and β are hyperparameters and correspond to the number of virtual heads and tails (pseudo counts) 39

40 Bayesian Estimation for Bernoulli Posterior distribution θ P θ x 1,, x N θ n h = P x 1,,x N θ P θ P x 1,,x N 1 θ n tθ α 1 1 θ β 1 = θ n h+α 1 1 θ n t+β 1 Posterior mean estimation: θ bayes = θ P θ D dθ = C θ θ n h+α 1 1 θ n t+β 1 dθ = (n h +α) N+α+β Prior strength: A = α + β A can be interpreted as an imaginary dataset 40

Lecture 8: Information Theory and Maximum Entropy

NEU 560: Statistical Modeling and Analysis of Neural Data Spring 2018 Lecture 8: Information Theory and Maximum Entropy Lecturer: Mike Morais Scribes: 8.1 Fundamentals of Information theory Information