Harrison B. Prosper. Bari Lectures

Similar documents
Harrison B. Prosper. Bari Lectures

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

TDA231. Logistic regression

Bayesian Machine Learning

Machine Learning Lecture 7

Expectation Propagation for Approximate Bayesian Inference

Lecture : Probabilistic Machine Learning

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Machine Learning Lecture 5

Introduction to Bayesian Inference

Overfitting, Bias / Variance Analysis

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

PATTERN RECOGNITION AND MACHINE LEARNING

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Bayesian Learning (II)

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Analysis Techniques Multivariate Methods

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Lecture 9: PGM Learning

Bayesian Regression Linear and Logistic Regression

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

STA 414/2104, Spring 2014, Practice Problem Set #1

Logistic Regression. Machine Learning Fall 2018

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Week 3: Linear Regression

Density Estimation. Seungjin Choi

Bayesian Decision and Bayesian Learning

Naïve Bayes classification

Learning Bayesian network : Given structure and completely observed data

Introduction to Machine Learning

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Generative Clustering, Topic Modeling, & Bayesian Inference

Announcements. Proposals graded

2 2 + x =

Nonparametric Bayesian Methods (Gaussian Processes)

CS229 Supplemental Lecture notes

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Machine Learning. Probabilistic KNN.

Harrison B. Prosper. CMS Statistics Committee

Machine Learning using Bayesian Approaches

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Linear Classification

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Lecture 6: Bayesian Inference in SDE Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning for Signal Processing Bayes Classification and Regression

CPSC 540: Machine Learning

17 : Markov Chain Monte Carlo

Stochastic Gradient Descent

CS-E3210 Machine Learning: Basic Principles

Multivariate statistical methods and data mining in particle physics

Lecture 6: Model Checking and Selection

Bayesian Linear Regression [DRAFT - In Progress]

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

CS281A/Stat241A Lecture 22

Optimal Screening for multiple regression models with interaction

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning Linear Classification. Prof. Matteo Matteucci

CPSC 540: Machine Learning

Bayesian Decision Theory

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Topics in statistical data analysis for high-energy physics

Feedforward Neural Networks. Michael Collins, Columbia University

David Giles Bayesian Econometrics

Today. Calculus. Linear Regression. Lagrange Multipliers

GAUSSIAN PROCESS REGRESSION

Multivariate Bayesian Linear Regression MLAI Lecture 11

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

Outline Lecture 2 2(32)

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

Linear Models for Regression

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

STAT 518 Intro Student Presentation

EnKF-based particle filters

COMP90051 Statistical Machine Learning

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

STA 4273H: Sta-s-cal Machine Learning

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Gaussian Processes in Machine Learning

INTRODUCTION TO PATTERN RECOGNITION

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

COMP90051 Statistical Machine Learning

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Linear Regression and Discrimination

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Statistical Methods for Particle Physics (I)

20: Gaussian Processes

Transcription:

Harrison B. Prosper Florida State University Bari Lectures 30, 31 May, 1 June 2016 Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 1

h Lecture 1 h Introduction h Classification h Grid Searches h Decision Trees h Lecture 2 h Boosted Decision Trees h Lecture 3 h Neural Networks h Bayesian Neural Networks 2

https://support.ba.infn.it/redmine/projects/statistics/wiki/mva Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 3

Today, we shall apply neural networks to two problems: h Wine Tasting h Approximating a 19-parameter function. Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 4

Recall that our goal is to classify wines as good or bad using the variables highlighted in blue, below: variables description acetic acetic acid citric citric acid sugar salt residual sugar NaCl SO2free free sulfur dioxide SO2tota total sulfur dioxide ph ph sulfate potassium sulfate http://www.vinhoverde.pt/en/history-of-vinho-verde alcohol alcohol content quality (between 0 and 1) Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 5

(One version of Problem 13): Prove that it is possible to do the following: f(x 1,,x n ) = F( g 1 (x (1),, x (m) ),, g k (x (1),, x (m) )) m < n for all n In 1957, Kolmogorov proved that it was possible with m = 3. Today, we know that functions of the form H I f (x 1,!,x I ) = a + b j tanh c j + d ji x i j=1 i=1 can provide arbitrarily accurate approximations of real functions of I real variables. (Hornik, Stinchcombe, and White, Neural Networks 2, 359-366 (1989)) 7

d ji c j H I f (x,w) = a + b j tanh c j + d ji x i j =1 i=1 b j x 1 x 2 a n(x, w) f is used for regression n is used for classification w = a, b, c, d are free parameters to be found by the training. 8

We consider the same two variables SO2tota and alcohol and the NN function shown here: We use data for 500 good wines and 500 bad wines. Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 9

To construct the NN, we minimize the empirical risk function where R(w) = 1 2 [ y i n(x i,w)] 2 y = 1 for good wines and y = 0 for bad wines, which yields an approximation to D(x) = 1000 i=1 p(x good) p(x good) + p(x bad) x = (SO2tota, alcohol) Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 10

The distribution of D(x) and the Receiver Operating Characteristic (ROC) curve. In this example, the ROC curve shows the fraction f of each class of wine accepted. Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 11

Actual p(good x) Best fit NN Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 12

14

Given, the posterior density p(w T), and some new data x one can compute the predictive distribution of y If a definite estimate of y is needed for every x, this can be obtained by minimizing a suitable risk function, R[ f w ] = L( y, f ) p( y x,t ) dy Using, for example, L = (y f ) 2 yields the following estimate f (x)! y(x,t ) In general, a different L, will yield a different estimate of f (x). y p( y x,t ) dy 15

For function approximation, the likelihood p(y x, w) for a given (y, x) is typically chosen to be Gaussian p(y x,w) = exp 1 2 y f (x,w) i i For classification one uses [ ] 2 / σ 2 p(y x,w) = [n(x,w)] y [1 n(x,w)] 1 y, y = 0, or 1 Setting y = 1, in gives p(1 x,t ) = n(x, w) p(w T ) dw which is an estimate of the probability that x belongs to the class for which y = 1. 16

) p(s p T4l 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 50 100 150 200 p (GeV) T4l 17

To construct the BNN, we sample points w from the posterior density p(w T ) = p(t w)p(w) p(t ) using a Markov Chain Monte Carlo method. This generates an ensemble of neural networks, just as in the 1-D example on the previous slide. Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 18

The distribution of D(x) (using the average of 100 NNs) and the ensemble of ROC curves, one for each NN. Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 19

Actual p(good x) <NN> Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 20

We see that if good and bad wines are equally likely and we are prepared to accept 40% of bad wines, while accepting 80% of good ones, then the BDT does slightly better. But, if we lower the acceptance rate of bad wines, all three methods work equally well. Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 21

Now that the Higgs boson has been found, the primary goal of researchers at the LHC is to look for, and we hope, discover new physics. There are two basic strategies: 1. Look for any deviation from the predictions of our current theory of particles (the Standard Model). 2. Look for the specific deviations predicted by theories of new physics, such as those based on a proposed symmetry of Nature called supersymmetry.

While finishing a paper related to his PhD dissertation, my student Sam Bein had to approximate a function of 19 parameters! Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 24

This is what I suggested he try. 1. Write the function to be approximated as 19 f (θ) = C(θ) g i (θ i ) i=1 2. Use NNs to approximate each g i (θ i ) by setting y(θ) = g. 3. Use another neural network to approximate the ratio p(θ) D(θ) = 19 p(θ) + g i (θ i ) i=1 where the background is sampled from the g functions and the signal is the original distribution of points. Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 25

4. Finally, approximate the function f using f (θ) = D(θ) 1 D(θ) 19 i=1 g i (θ i ) The intuition here is that much of the dependence of f on θ is captured by the product function, leaving (we hoped!) a somewhat gentler 19-dimensional function to be approximated. The next slide shows the neural network approximations of a few of the g functions. Of course, 1-D functions can be approximated using a wide variety of methods. The point here is to show the flexibility of a single class of NNs. Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 26

-3 10 0.1 0.09 0.08 0.07 0.06 0.05 0.04-6000-4000-2000 0 2000 4000 6000 Ab -3 10 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-3000 -2000-1000 0 1000 2000 3000 M1-3 10 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0-6000-4000-2000 0 2000 4000 6000 At 10 0.25 0.2 0.15 0.1 0.05-3 0-3000 -2000-1000 0 1000 2000 3000 M2 Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 27

Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 28

h Multivariate methods can be applied to many aspects of data analysis, including classification and function approximation. Even though there are hundreds of methods, almost all are numerical approximations of the same mathematical quantity. h We considered random grid searches, decision trees, boosted decision trees, neural networks, and the Bayesian variety. h Since no method is best for all problems, it is good practice to try a few of them, say BDTs and NNs, and check that they give approximately the same results. 29