Applied Machine Learning for Design Optimization in Cosmology, Neuroscience, and Drug Discovery

Size: px

Start display at page:

Download "Applied Machine Learning for Design Optimization in Cosmology, Neuroscience, and Drug Discovery"

Shannon Harmon
6 years ago
Views:

1 Applied Machine Learning for Design Optimization in Cosmology, Neuroscience, and Drug Discovery Barnabas Poczos Machine Learning Department Carnegie Mellon University Machine Learning Technologies and their Applications for Scientific and Engineering Domains Workshop NASA Langley Research Center 1 August 16, 2016

2 Goal: Create a Scientific Assistant ML & Stat Scientist Automate discoveries find anomalies, interesting events scientific laws Experiment Collect Data Analysis 2 Recommend experiments to run Recommend what to collect Recommend how to analyze

3 Computer vision, Robotics EEG, fmri, MEG, Astronomy machine learning applications Drug Discovery AI in games Neuroscience 3 Turbulences ML in Agriculture Microarray 3

4 Machine Learning on Complex Objects 4

5 Traditional Machine Learning Observations Feature vectors Training data of feature vectors ML algorithm: classification, regression, clustering, etc 5

6 Finance Complex Data is Everywhere Neuroscience Diffusion Weighted Imaging Cosmology Images 6 6

7 Distributional Data Manchester United 07/08 Owen Hargreaves Rio Ferdinand Cristiano Ronaldo 7

sample Standard ML on sets/distributions machine learning Set

8 ML on Distributions healthy or sick? Medical tests: blood pressure, heart rate, temperature, blood sample Standard ML on sets/distributions machine learning Set Feature of feature vector vectors Classifier Healthy Sick What happens if we repeat the medical tests? 8

9 Distribution Regression / Classification Y 1 =1 Y 2 =0 Y 3 =1 Y m =0? Differences compared to standard methods on vectors P 1 P 2 P 3 P P The inputs are distributions, density functions (not m vectors) m+1 We don t know these distributions, only sample sets are available (error in variables model) 9

10 Distribution Classification Solution: Use RKHS based SVM! Calculate the Gram matrix Dual form of SVM: 10 Problems:

11 Distances / Divegences between Distributions RÉNYI DIVERGENCE ESTIMATION Using without density estimation Estimate divergence 11 11

12 12 The Estimator

13 ML on Distributions 13 Dealing with complex objects break into smaller parts, represent the input as a set of smaller parts treat the set elements as sample points from some unknown distribution do ML on these unknown distributions represented by sets

14 Sport Events Classification [Li and Fei Fei, 2007] badminton bocce croquet polo climbing rowing sailing snowboard 8 categories, 1040 images, each represented by 295 to dim points. Best published: 86.7% (Zhang et al, CVPR 2011) NPR: 87.1% 14 2 fold CV, 16 runs

15 50 highway images Detecting Anomalous Images 5 anomalies 15 2-dimensional sample set representation of images (128 dim SIFT ) 2 dim) Anomaly score: divergences between the distributions of these sample sets 15

16 16 Cosmology Applications

Find new scientific laws in physics Goal: Estimate dynamical mass of galaxy clusters. Importance: Galaxy clusters are being the largest gravitationally bound systems in the Universe.

17 Find new scientific laws in physics Goal: Estimate dynamical mass of galaxy clusters. Importance: Galaxy clusters are being the largest gravitationally bound systems in the Universe. Dynamical mass measurements are important to understand the behavior of dark matter and normal matter. Difficulty: We can only measure the velocity of galaxies not the mass of their cluster Physicists estimate dynamical cluster mass from single velocity dispersion. 17 Our method: Estimate the cluster mass from the whole distribution of velocities rather than just a simple velocity distribution.

18 Find new scientific laws in physics 18 Michelle Ntampaka et al, A Machine Learning Approach for Dynamical Mass Measurements of Galaxy Clusters, APJ 2015

Find interesting Galaxy Clusters Sloan Digital

galaxy clusters (10-50 galaxies in each) 7530

mostly star forming blue galaxies irregular

19 Find interesting Galaxy Clusters Sloan Digital Sky Survey (SDSS) continuum spectrum 505 galaxy clusters (10-50 galaxies in each) 7530 galaxies Blue galaxy What are the most anomalous galaxy clusters? The most anomalous galaxy cluster contains mostly star forming blue galaxies irregular galaxies 19 B. Póczos, L. Xiong & J. Schneider, UAI, Red galaxy Credits: ESA, NASA

20 Find the parameters of Universe Given a distribution of particles, our goal is to predict the parameters of the simulated universe 20

21 Active Learning & Design Optimization 21

hypothetical parameters, mathematical model simulated observations 22 Computation problem: How

22 NASA / ESA Recommend experiments to find the true parameters of the universe g( ) surrogate function parameter space NASA true parameters real universe noisy observations hypothesis test hypothetical parameters, mathematical model simulated observations 22 Computation problem: How to search parameter space Solution: Learn a surrogate function and make experiment decisions using it

Recommend experiments for drug discovery

Observations ($, time) Real Observations

23 Recommend experiments for drug discovery parameter space Parameters of Drugs: Compounds Quantities etc. Drug effects on the Lab mouse Expensive Observations ($, time) Real Observations Simulated Observations Expert predictions (Bias, Variance, Fidelity, Cost) Observations: Blood samples Camera images EEG, etc. 23 g( ) surrogate function to optimize

24 predicted number of galaxies Learning Relationships from Simulations Goal: predict the number of galaxies in a halo from a half dozen dark matter halo parameters 24 true number of galaxies [Xiaoying Xu, 2012] (#particles in a halo, velocity dispersion, max circular velocity, half mass radius, ) data: Millenium simulation 395,832 halos method: support vector regression

25 The Galaxy Zoo challenge Crowdsourcing project Users are asked to describe the morphology of galaxies based on images. They are asked questions such as How rounded is the galaxy and Does it have a central bulge 37 different categories in a decision tree Training set: JPG images of galaxies. Test set: JPG images of galaxies Image resolution: 424x424 color JPEG images 25

26 26 Willett et al

27 27

28 The Large Synoptic Survey Telescope Big data questions 15 Terabytes of data every night 28

29 29 Other Examples in Physics

30 30 ML to Help Understanding Turbulences

Turbulence Data Classification Simulated fluid flow through time (JHU Turbulence Research Group, Alex Szalay) Goal: find interesting vortices!

31 Turbulence Data Classification Simulated fluid flow through time (JHU Turbulence Research Group, Alex Szalay) Goal: find interesting vortices! events, patterns, phenomena 11 positive, 20 negative examples Results: Leave one out cross-validation : 97% Velocity distributions 31 Positive (vortex) Negative Negative 31

32 Find Interesting Phenomena in Turbulence Data Anomaly detection 32 Anomaly scores

33 Finding Vortices 33 Classification probabilities

34 34 Fusion power plants

35 35 Neuroscience

36 ML in Action: Neuroimaging MEG/ fmri mind reading contest MRI lie detector Decoding thoughts from brain scans Rob a bank 36 36

37 FuSSO = Functional Shrinkage and Selection Operator (Functional Lasso) 37 37

38 Function-to-Real regression 38 38

39 Many Functions-to-Real regression Similarly, one may consider a mapping that takes in multiple functions: 39 39

40 Sparse Functions-to-Real regression When the number of functional input covariates may be very large, a sparse model that depends only on a few of the functional covariates may be preferred: Goal: Finds a sparse set of functional input covariates to predict a real-valued response

41 FuSSO Example Applications Finance: Inputs: Time-series of several product prices in the past Output: Price of a particular product in the nearby future A s Prices J s Prices Future Price 41 K s Prices 41

42 FuSSO Applications in Neuroimaging Inputs: Functions at each voxel (e.g. orientation distribution functions) Output: The age of the subject 42 Voxels ODFs Image credit: Age 42

Results: Neuroimaging dataset Dataset with over 25K functions per subject for 89 total subjects (18 to 60 years old) Orientation distribution functions (ODF) at white matter voxels Goal: Predict the

43 Results: Neuroimaging dataset Dataset with over 25K functions per subject for 89 total subjects (18 to 60 years old) Orientation distribution functions (ODF) at white matter voxels Goal: Predict the subject's age, given ODFs We compared to LASSO with peak ODF (quantitative anisotropy, QA) values. Finite dim non-functional data set. Ages Example Voxel ODF Image Sources:

43 Absolute Errors per Subject Selected Voxels 44

44 Results: Neuroimaging dataset Results: Method: FuSSO (ODFs) LASSO (QAs) Mean Predict MSE: Absolute Errors per Subject Selected Voxels 44 Mean error: 8.3 years, Naïve approach error: 12.5 years 44

45 45 Agriculture

46 Agriculture Recommend experiments (which plants to cross) to sorghum breeders. 46

47 CMU Robot 47 1 PB data

48 48

Name Range RMSE error Leaf angle* 75.94 3.30 (4.

60%) Leaf length* 35.00 0.87 (2.49%) Leaf width [max] 3.

49 Name Range RMSE error Leaf angle* (4.35%) Leaf radiation angle* (3.60%) Leaf length* (2.49%) Leaf width [max] (7.48%) Leaf width [average] (7.o2%) 49 Leaf area* (6.08%)

50 50 Grapes datasets

51 Take me home! ML on Complex Objects o ML on distributions o Lasso on functions Active learning and design optimization Applications: Cosmology Drug Design Agriculture Neuroscience 51

52 Thanks for your attention! If interested, please contact me! GHC

53 Linear Functional Regression One Real Vector vs. Functional Covariate: Real Vector Covariate Y i = hx i ; wi + ² i Functional Covariate Y i = hf (i) ; gi + ² i X i ; w 2 R d and hx i ; wi = P d j=1 X ijw j f (i) ; g 2 L 2 (ª) and hf (i) ; gi = R ª f (i) (t)g(t)dt 53 53

Physics Lab #10: Citizen Science - The Galaxy Zoo

Physics Lab #10: Citizen Science - The Galaxy Zoo Physics 10263 Lab #10: Citizen Science - The Galaxy Zoo Introduction Astronomy over the last two decades has been dominated by large sky survey projects. The Sloan Digital Sky Survey was one of the first