Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Recent Applica6ons of Lasso

Size: px

Start display at page:

Download "Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Recent Applica6ons of Lasso"

Regina Butler
5 years ago
Views:

1 Machine Learning Data Mining CS/CNS/EE 155 Lecture 4: Recent Applica6ons of Lasso 1

2 Today: Two Recent Applica6ons Cancer Detec0on Personaliza0on via twi9er music( Biden( soccer( Labour( Applica6ons of Lasso (and related methods) Think about the data modeling goals Some new learning problems Slide material borrowed from Rob Tibshirani and Khalid El- Arini Image Sources: hbp:// hbps://dl.dropboxusercontent.com/u/ /papers/badgepaper- kdd2013.pdf 2

3 Aside: Convexity Not Convex Easy to find global op0ma! Strict convex if diff always >0 Image Source: hbp://en.wikipedia.org/wiki/convex_func6on 3

4 Aside: Convexity All local op6ma are global op6ma: Strictly convex: unique global op6mum: Almost all objec6ves discussed are (strictly) convex: SVMs, LR, Ridge, Lasso (except ANNs) 4

5 Cancer Detec6on 5

6 Molecular assessment of surgical- resec6on margins of gastric cancer by mass- spectrometric imaging Proceedings of the Na0onal Academy of Sciences (2014) Livia S. Eberlin, Robert Tibshirani, Jialing Zhang, Teri Longacre, Gerald Berry, David B. Bingham, Jeffrey Norton, Richard N. Zare, and George A. Poultsides hbp:// hbp://statweb.stanford.edu/~6bs/gp/canc.pdf Extracted part Gastric (Stomach) Cancer Left in patient Cancer Stromal Epithelial Normal margin 1. Surgeon removes 6ssue 2. Pathologist examines 6ssue Under microscope 3. If no margin, GOTO Step 1. Image Source: hbp://statweb.stanford.edu/~6bs/gp/canc.pdf 6

7 Drawbacks Expensive: requires a pathologist Slow: examina6on can take up to an hour Unreliable: 20%- 30% can t predict on the spot Extracted part Gastric (Stomach) Cancer 1. Surgeon removes 6ssue Left in patient Cancer Stromal Epithelial Normal margin 2. Pathologist examines 6ssue Under microscope 3. If no margin, GOTO Step 1. Image Source: hbp://statweb.stanford.edu/~6bs/gp/canc.pdf 7

8 Machine Learning to the Rescue! (actually just sta6s6cs) Lasso originated from sta6s6cs community. But we machine learners love it! Basic Lasso: Train a model to predict cancerous regions! Y = {C,E,S} (How to predict 3 possible labels?) What is X? What is loss func6on? argmin λ w + w,b N i=1 L( y i, w T x i b) 2 8

9 Mass Spectrometry Imaging DESI- MSI (Desorp6on Electrospray Ioniza6on) Effec6vely runs in real- 6me (used to generate x) hbp://en.wikipedia.org/wiki/desorp6on_electrospray_ioniza6on Image Source: hbp://statweb.stanford.edu/~6bs/gp/canc.pdf 9

10 Cancer Each pixel is data point x via spectroscopy y via cell- type label Epithelial Stromal Spectrum for each pixel x Spectrum sampled at 11,000 m/z values Image Source: hbp://statweb.stanford.edu/~6bs/gp/canc.pdf 10

11 x Spectrum sampled at 11,000 m/z values Each pixel has 11K features. Visualizing a few features. Image Source: hbp://statweb.stanford.edu/~6bs/gp/canc.pdf 11

12 Mul6class Predic6on Mul6class y: Most common model: N S = {(x i, y i )} i=1 x R D y { 1, 2,..., K} Replicate Weights:! # # w = # # "# w 1 w 2! w K $ %! # # b = # # "# b 1 b 2! b K $ % Loss func0on? Score All Classes: " $ $ f (x w, b) = $ $ $ # w 1 T x b 1 w 2 T x b 2! w K T x b K % ' ' ' ' ' Predict via Largest Score: argmax k " $ $ $ $ $ # w 1 T x b 1 w 2 T x b 2! w K T x b K % ' ' ' ' ' 12

13 Mul6class Logis6c Regression Binary LR: P(y x, w, b) = e y ( wt x b) ( ) y w + e ( T x b) e y wt x b y { 1,+1 } Log Linear Property: P(y x, w, b) e y ( wt x b) (w 1,b 1 ) = (- w - 1,- b - 1 ) Extension to Mul0class: P(y = k x, w, b) e w T k x b k Keep a (w k,b k ) for each class Mul0class LR: P(y = k x, w, b) = m T ew k x b k e w T m x b m Referred to as Mul6nomial Log- Likelihood by Tibshirani hbp://statweb.stanford.edu/~6bs/gp/canc.pdf 13

14 Mul6class Log Loss argmin w,b i P(y x, w, b) = ln P(y i x i, w, b) m T ew y x b y e w T m x b m # ln P(y x, w, b) = w T y x + b y + ln% $ m e w T m x b m! # # w = # # "# ( ' w 1 w 2! w K x R D y { 1, 2,..., K} $ %! # # b = # # "# b 1 b 2! b K $ % $ wk ln P(y x, w, b) = % ' ( 1+ P(y x, w, b) ) x if y = k P(y x, w, b)x if y k 14

15 Mul6class Log Loss Suppose x=1 ignore b Model score is just w k Vary one weight, others = 1 Log Loss y=k y k # ln P(y x, w, b) = w T y x + b y + ln% $ m e w T m x b m ( ' w k $ wk ln P(y x, w, b) = % ' ( 1+ P(y x, w, b) ) x if y = k P(y x, w, b)x if y k 15

16 Lasso Mul6class Logis6c Regression argmin λ w + ln P(y i x i, w, b) w,b i x R D y { 1, 2,..., K} k k d w = w k = w kd! # # w = # # "# w 1 w 2! w K $ %! # # b = # # "# b 1 b 2! b K $ % Probabilis6c model Sparse weights 16

17 Back to the Problem Image Tissue Samples Each pixel is an x 11K features via Mass Spec Computable in real 6me 1 predic6on per pixel y via lab results ~2 weeks turn- around Visualiza6on of all pixels for one feature Spectrum sampled at 11,000 m/z values x 17

Learn a Predic6ve Model Training set: 28 6ssue samples from 14 pa6ents Cross valida6on to select λ Test set: 21 6ssue samples from 9 pa6ents Test Performance: Predicted 0.

18 Learn a Predic6ve Model Training set: 28 6ssue samples from 14 pa6ents Cross valida6on to select λ Test set: 21 6ssue samples from 9 pa6ents Test Performance: Predicted 0.2 margin in probability Pathology Cancer Epithelium Stroma Don t know Agreement, % Overall agreement, % Cancer 5, Epithelium 134 3, Stroma , Cancer Normal Agreement, % Overall agreement, % Cancer 5, Normal 159 6,

19 Lasso yields sparse weights! (Manual Inspec0on Feasible!) Many correlated features Lasso tends to focus on one hbp://cshprotocols.cshlp.org/content/2008/5/pdb.prot

20 Extension: Local Linearity P(y x, w, b) = T ew y x b y e w T m x b m m Assumes probability shigs along straight line Ogen not true Approach: cluster based on x Train customized model for each cluster Patient Overall Standard training 0.29% 4.56% 6.78% 0.00% 13.76% 2.77% 3.58% Customized training 0.71% 1.89% 0.82% 0.40% 9.43% 0.92% 1.89% hbp://statweb.stanford.edu/~6bs/gp/canc.pdf 20

21 Recap: Cancer Detec6on Seems Awesome! What s the catch? Small sample size Tested on 9 pa6ents Machine Learning only part of the solu6on Need infrastructure investment, etc. Analyze the scien6fic legi6macy Social/Poli6cal/Legal If there is mis- predic6on, who is at fault? 21

22 Personaliza6on via twiber music( Biden( soccer( Labour( 22

23 Represen6ng Documents Through Their Readers Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (2013) Khalid El- Arini, Min Xu, Emily Fox, Carlos Guestrin hbps://dl.dropboxusercontent.com/u/ /papers/badgepaper- kdd2013.pdf overloaded by news 1 million news ar6cles blog posts generated every hour* * [ 23

24 News Recommenda6on Engine user corpus Vector representa0on: Bag of words LDA topics etc. 24

25 Challenge Most common representa6ons don t naturally line up with user interests Fine- grained representa0ons (bag of words) too specific High- level topics (e.g., from a topic model) - too fuzzy and/or vague - can be inconsistent over 0me 25

26 Goal Improve recommenda6on performance through a more natural document representa6on 26

27 An Opportunity: News is Now Social In 2012, Guardian announced more readers visit site via Facebook than via Google search 27

28 badges 28

29 Approach Learn a document representa6on based on how readers publicly describe themselves 29

30 30

31 Using many tweets, can we learn that someone who iden6fies with via profile badges music reads ar6cles with these words:? 31

32 Given: training set of tweeted news ar6cles from a specific period of 6me 1. Learn a badge dic0onary from training set music 3 million ar6cles words badges 2. Use badge dic6onary to encode new ar0cles 32

33 Advantages Interpretable Clear labels Correspond to user interests Higher- level than words 33

34 Advantages Interpretable Clear labels Correspond to user interests 34

35 Advantages Interpretable Clear labels Correspond to user interests Higher- level than words Seman6cally consistent over 6me poli0cs 35

36 Given: training set of tweeted news ar6cles from a specific period of 6me 1. Learn a badge dic0onary from training set music 3 million ar6cles words badges 2. Use badge dic6onary to encode new ar0cles 36

37 Training data : Dic6onary Learning Iden6fies badges in TwiBer profile of tweeter S = N {( z i, y )} i i=1 Bag- of- words representa6on of document y album Fleetwood Mac love Nicks z gig music cycling linux 37

38 Training Objec6ve: Dic6onary Learning argmin B,W Iden6fies badges in TwiBer profile of tweeter S = N {( z i, y )} i i=1 N i=1 λ B B + λ W W + y i BW i 2 Bag- of- words representa6on of document Dic0onary Encoding music! words! badges! badges! ar.cles! 38

39 argmin B,W N i=1 λ B B + λ W W + y i BW i 2 Dic0onary Encoding music! words! badges! badges! ar.cles! Not convex! (because of BW term) Convex if only op6mize B or W (but not both) Alterna6ng Op6miza6on (between B and W) How to ini6alize? Use: S = N {( z i, y i )} i=1 z Ini6alize: W i = z i z i gig music cycling linux 39

40 argmin B,W Suppose Badge s ogen co- occurs with Badge t B s B t are correlated N i=1 λ B B + λ W W + y i BW i 2 From perspec6ve of W, B s are features. Lasso tends to focus on one correlated feature Many ar6cles might be about Gig, Fes6val Music simultaneously. 40

41 argmin B,W Suppose Badge s ogen co- occurs with Badge t B s B t are correlated From perspec6ve of W, B s are features. Lasso tends to focus on one correlated feature Graph Guided Fused Lasso: N argmin λ B B + λ W W + λ G ω st W is W it + 2 y i BW i B,W i=1 (s,t) E(G) N i=1 λ B B + λ W W + y i BW i 2 N i=1 Graph G of related Badges Co- occurance Rate On TwiBer Profiles 41

42 Encoding New Ar6cles Badge Dic6onary B is already learned Given a new document j with word vector y j Learn Badge Encoding W j : argmin λ W W j + λ 2 G W js W jt + y j BW j W j (s,t) G 42

43 Recap: Badge Dic6onary Learning 1. Learn a badge dic0onary from training set music words badges 2. Use badge dic6onary to encode new ar0cles 43

44 music Examining B Biden September 2012 soccer Labour 44

45 Badges Over Time music Biden September 2012 September

46 A Spectrum of Pundits top conserva6ves on TwiBer Limit badges to progressive and TCOT Predict poli6cal alignments of likely readers? more conserva6ve Dowd Krugman Parker Rich Kristof Kristol Klein Goldberg Brooks Friedman Zakaria Ignatius Krauthammer Coulter Took all ar6cles by columnist Looked at encoding score progressive vs TCOT Average

47 User Study Which representa6on best captures user preferences over 6me? Study on Amazon Mechanical Turk with 112 users 1. Show users random 20 ar6cles from Guardian, from 6me period 1, and obtain ra6ngs 2. Pick random representa6on bag of words, high level topic, Badges 3. Represent user preferences as mean of liked ar6cles 4. GOTO next 6me period Recommend according to preferences GOTO STEP 2 47

48 User Study 10 Number of liked articles beber 5 tf idf High LDA Level Topic badges Badges Bag of Words 48

49 Recap: Personaliza6on via twiber Sparse Dic6onary Learning Learn a new representa6on of ar6cles Encode ar6cles using dic6onary BeBer than Bag of Words BeBer than High Level Topics Based on social data Badges on twiber profile twee6ng Seman6cs not directly evident from text alone 49

50 Next Week Sequence Predic6on Hidden Markov Models Condi6onal Random Fields Homework 1 due Tues via Moodle 50

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 8: Hidden Markov Models

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 8: Hidden Markov Models 1 x = Fish Sleep y = (N, V) Sequence Predic=on (POS Tagging) x = The Dog Ate My Homework y = (D, N, V, D, N) x = The Fox Jumped