2017 Predictive Analytics Symposium

Size: px

Start display at page:

Download "2017 Predictive Analytics Symposium"

Kathleen Thomas
6 years ago
Views:

1 2017 Predictive Analytics Symposium Session 14, Introduction to Machine Learning Moderator: Robert Anders Larson, FSA, MAAA Presenter: Boyi Xie, Ph.D. SOA Antitrust Compliance Guidelines SOA Presentation Disclaimer

2 Introduction to Machine Learning Boyi Xie, SOA Predictive Analytics Symposium, 09/14/2017

3 Outline What is Machine Learning Empirical Risk Minimization Cross validation Supervised Learning Unsupervised Learning Ranking Bayesian Models Applications 2

4 What is Machine Learning Interdisciplinary field: Computer Science Electrical Engineering Math Statistics Physics Operation Research Psychology etc A branch in Artificial Intelligence that focuses on theory of learning algorithms and their applications on problem solving 3

5 Empirical Risk Minimization Idea: minimize loss on the training data set Empirical: use training set to find the best fit Define a loss function of how good we fit a single point LL(yy, ff xx ) Empirical Risk: average loss over the dataset NN RR = 1 NN LL(yy ii, ff xx ii ) ii=1 Squared error LL yy ii, ff xx ii = 1 2 (yy ii ff xx ii ) 2 Absolute error LL yy ii, ff xx ii = yy ii ff xx ii 4

6 Empirical Risk Minimization Fitting a polynomial model 5

7 Regularized Risk Minimization We want to add a penalty to the complexity of the model, such as size of parameters This gives us the Regularized Risk 6

8 Evaluating Our Learned Function We minimized empirical risk to get θθ How well does ff(xx; θθ ) perform on future data? It should Generalize and have low True Risk: RR tttttttt (θθ) = PP xx, yy LL yy, ff xx; θθ dddddddd Can t compute true risk, instead use Testing Empirical Risk We randomly split data into training and testing portions { xx 1, yy 1,, (xx NN, yy NN } { xx NN+1, yy NN+1,, (xx NN+MM, yy NN+MM } Find θθ with training data: RR tttttttttt (θθ) = 1 NN LL(yy ii, ff xx ii ; θθ ) ii=1 NN Evaluate it with testing data: NN+MM RR tttttttt (θθ) = 1 MM LL(yy ii, ff xx ii ; θθ ) ii=nn+1 7

9 Empirical Risk Minimization Idea: minimize loss on the training data set Empirical: use training set to find the best fit RR rrrrrrrrrrrrrrrrrrrrrr θθ = RR eeeeeeeeeeeeeeeeee θθ + PPPPPPPPPPPPPP θθ Select Lambda which gives lowest cost Lambda measures simplicity of the model NN = 1 NN LL yy ii, ff xx ii ; θθ ii=1 + λλ θθ 2 2NN 8

10 Frequentists & Bayesians Frequentists (Neymann/ Pearson/ Wald) Classical/ objective view/ no priors Data are repeatable random sample there is a frequency Underlying parameters remain constant during this repeatable process Frequentist inference: estimate one best model parameters often use the ML estimator (unbiased & minimum variance) Bayesians (Bayes/ Laplace/ de Finetti) Unknown quantities are treated probabilistically and the state of the world can always be updated Data are observed from the realized sample Parameters are unknown and described probabilistically Put a distribution or pdf on all variables in the problem 9

Models of Interest Over Time A view on the development of machine learning ANOVA Naive Bayes classifier 1940 Perceptron Nearest Neighbor Algorithm CART Boltzmann machine Conditional Random Field

11 Models of Interest Over Time A view on the development of machine learning ANOVA Naive Bayes classifier 1940 Perceptron Nearest Neighbor Algorithm CART Boltzmann machine Conditional Random Field Multinomial logistic regression Apriori algorithm Hidden Markov models CHAID Bootstrap (bagging) Support Vector Machines Gene expression programming Bayesian network Deep Learning Single-linkage clustering Fisher's linear discriminant K-means algorithm Random Forests Self-organizing map C4.5 MapReduce Quadratic classifiers Logistic regression Non-parametric Bayesian Boosting Backpropagation Word Embedding Expectation-maximization algorithm Latent Dirichlet Allocation

Supervised Learning Labels are given for data

To develop a general rule based on inductive

12 Supervised Learning Labels are given for data points. The goal is to maximize the log-likelihood of data given models. To develop a general rule based on inductive inference Example models Perceptron Logistic Regression Support Vector Machines K-NN Decision Tree Neural Networks/ Deep Learning 11

13 Perceptron Linear Discriminant Functions and Decision Hyperplanes gg xx = ww TT xx + ww 0 = 0 Assume there are two classes {-1}, {+1} are linearly separable. i.e. there exists a hyperplane, defined by ww TT xx = 0, such that wwtt xx > 0, xx {+1} ww TT xx < 0, xx { 1} We approach the problem as an optimization task Choose a cost function perceptron cost JJ ww = δδ xx ww TT xx xx YY Use gradient descent to iteratively minimize the cost function ww tt + 1 = ww tt ρρ tt ww ww 12

learning rate t=0 Repeat o Let o For i=1 to N o o Update w: o Adjust o t=t+1 Until ww 0 ρρ 0 YY

14 Perceptron Use gradient descent to find w by iteratively minimizing of the cost function ww ww tt + 1 = ww tt ρρ tt ww = ww tt ρρ tt δδ xx xx Algorithm xx YY Initialize, and choose a learning rate t=0 Repeat o Let o For i=1 to N o o Update w: o Adjust o t=t+1 Until ww 0 ρρ 0 YY = If δδ xxii ww(tt) TT xx ii 0 then YY = YY {xx ii } ρρ tt YY = ww tt + 1 = ww tt ρρ tt δδ xx xx xx YY 13

above two equations, we have MM PP ww MM xx = 1 1 + MM 1 ii=1 exp(ww TT ii xx) The standard logistic function PP ww ii xx = exp(ww

15 Logistic Regression In logistic regression, the logarithm of the likelihood ratios is modeled via linear functions ln PP(ww ii xx) PP(ww MM xx) = ww ii TT xx, for i=1, 2,, M-1 M is the number of classes We also need to ensure PP ww ii xx = 1 ii=1 Combining the above two equations, we have MM PP ww MM xx = MM 1 ii=1 exp(ww TT ii xx) The standard logistic function PP ww ii xx = exp(ww TT ii xx) 1 + MM 1 ii=1 exp(ww TT ii xx), for i=1, 2,, M-1 Also called multinomial logistic regression, or maximum entropy model 14

16 Support Vector Machines Search for the hyperplane that gives the maximum possible margin Linear kernel on Separable data Gaussian ke rne l on non-separable data using Soft-margin SVM for 15

training examples in the feature space Example of k-nn

17 Nearest Neighbors A non-parametric method used for classification and regression, where the decision is based on the k closest training examples in the feature space Example of k-nn classification with k = 3 (solid line circle). The majority vote is the predicted class. 16

18 Decision Tree A class of nonlinear classifiers where feature space is split into unique regions, corresponding to the classes, in a sequential manner. Key elements in designing a decision tree algorithm At each node, a set of candidate questions (or features) to be considered that would split into descendant nodes Splitting criterion, e.g. information gain, gini index Stop-splitting rule, e.g. min num of instances in a leaf A rule to assign each leaf to a specific class Example decision tree algorithms CART (Classification And Regression Tree) C4.5 CHAID 17

19 Ensemble Methods Bagging Boosting Can a weak learning algorithm be boosted into a strong algorithm? Choose a base classifier, i.e. a weak classifier A series of classifiers is then designed iteratively, employing each time the base classifier but using a different subset of the training set, according to a different weighting over the training samples to give emphasis to the hardest (incorrectly classified) samples The final classifier is obtained as a weighted average of the previous designed classifiers. Popular Models Random Forest Gradient Boosting Machine 18

Neural Network and Deep Learning A family of statistical

The interconnection pattern between the different layers of

interconnections The activation function that converts a

20 Neural Network and Deep Learning A family of statistical learning models inspired by biological neural networks. The interconnection pattern between the different layers of neurons The learning process for updating the weights of the interconnections The activation function that converts a neuron's weighted input to its output A (shallow) neural network A deep network 19

21 Supervised Learning vs. Unsupervised Learning Recall in classification problem, we maximize the log-likelihood of data given models: NN ll = nn=1 log pp xx nn, yy nn ππ, μμ, Σ = NN nn=1 log ππ yynn NN( xx nn μμ yynn, Σ yynn ) If we don t know the class, treat it as a hidden variable, we maximize the log-likelihood with unlabeled data NN ll = nn=1 NN log pp xx nn ππ, μμ, Σ = nn=1 log yy=1 Instead of classification, we now have a clustering problem KK pp( xx nn, yy ππ, μμ, Σ) 20

22 K-Means Clustering K-Means solves a Chicken-and-Egg problem: if we knew classes, we can get the model (e.g. max likelihood) if we knew the model, we can predict the classes K-Means: guess a model, use it to classify the data, use classified data as labeled data to update the model, repeat. to minimize cost function 1. Input dataset min μμ min zz JJ μμ 1,, μμ KK, zz 1,, zz NN { xx 1,, xx NN } NN = nn=1 KK 2 zz nn (ii) xx nn μμ ii ii=1 2. Randomly initialize means 5. If any z has changed, go to 3 μμ 1,, μμ KK 3. Find closest mean for each point zz nn ii = 1, iiii ii = arg min 2 jj xx nn μμ jj 0, ooooooooooooooooo 4. Update means μμ ii = NN nn=1 xx nn zz nn (ii) NN zz nn (ii) nn=1 21

23 Expectation-Maximization (EM) EM is a soft version of K-Means zz nn ii = 1, iiii ii = arg min jj xx nn μμ jj 2 = arg maxjj NN xx nn μμ nn, II = arg max jj pp( xx nn μμ jj ) 0, ooooooooooooooooo Instead, consider soft percentage assignment of data points Expectation: soft class assignment (tt) (tt) ππ (tt) ii NN xx nn μμ nn, Σii ττ nn,ii = (tt) (tt) jj ππ jj NN xx nn μμ nn, Σii Maximization: (tt) (tt+1) nn ττ nn,ii xxnn μμ ii = (tt) nn ττ nn,ii mean (tt) (tt+1) nn ττ nn,ii ππ ii = NN mixing proportions (tt+1) nn ττ tt nn,ii ( xx nn μμ tt+1 ii ) ( xx nn μμ tt+1 ii ) TT Σ ii = tt nn ττ nn,ii covariance 22

Hierarchical Agglomerative Clustering A bottom up

cluster, and pairs of clusters are merged as one

Manning, Prabhakar Raghavan and Hinrich Schütze,

24 Hierarchical Agglomerative Clustering A bottom up approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. Distance Metric Linkage Criteria Image: Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press

25 Ranking General Public Release Goal: Rank positive instances to the top and negative instances to the bottom An example application: Rank instances according to their likelihood to be involved in a serious event. i.e. top of the ranked list are serious instances; bottom of the ranked list are non-serious instances. Serious events are rare. More formally: the goal is to construct a ranking function f that gives a real valued score to each instance in X, that is, f : x->r. Note: Because of the rarity of the serious events, predicting the actual probability of a rare event may not be feasible (or accurate). Thus, we do not care about the actual values, but only the relative values btw instances. We can formulize a general form of our objective: where x: a data instance I: instances of serious events K: instances of non-serious events f: a function to map an instance to a seriousness score l: a loss function to penalize a small score of a serious event g: a price function to penalize a high ranked non-serious event Ranked Lis t of Instances more serious less serious ff(xx ii ) ff(xx ii ) ff(xx kk ) ff(xx ii ) ff(xx kk ) ff(xx kk ) 24

Bayesian Models an example in topic modelling General Public

engine bonnet tyres lorry boot Synonymy car emissions hood make

Small cosine dissimilarity but are related Large cosine

26 Bayesian Models an example in topic modelling General Public Release Distinguish words or cluster words by semantics auto engine bonnet tyres lorry boot Synonymy car emissions hood make model trunk Polysemy make hidden Markov model emissions normalize Small cosine dissimilarity but are related Large cosine dissimilarity but not truly related Uncover the latent relation between documents and words 25

27 Probabilistic Models BOW & NB General Public Release Models related to aspect model (topic model) Document is a mixture of underlying (latent) K aspects (topics) Each aspect is represented by a distribution of words p(w z) Unigram model Mixture of unigram model Prob. Latent Semantic Indexing La te n t Dirichlet Allocation Mixture of unigram model (Naïve Bayes) For each of M documents Choose a topic z Z i Choose N words by drawing each one independently from a multinomial conditioned on z w i1 w 2i w 3i w 4i One topic per document 26

28 Probabilistic Models PLSI General Public Release Models related to aspect model (topic model) Document is a mixture of underlying (latent) K aspects (topics) Each aspect is represented by a distribution of words p(w z) Unigram model Mixture of unigram model Prob. Latent Semantic Indexing La te n t Dirichlet Allocation Hofmann99 (Probabilistic Latent Semantic Indexing model) For each word of document d in the training set d # parameters: km+kv Choose a topic z according to a multinomial conditioned on the index d Generate the word by drawing from a multinomial conditioned on z. z d1 z d2 z d3 z d4 Documents can have multiple topics w d1 w d2 w d3 w d4 27

29 Probabilistic Models LDA General Public Release Models related to aspect model (topic model) Document is a mixture of underlying (latent) K aspects (topics) Each aspect is represented by a distribution of words p(w z) Unigram model Mixture of unigram model Prob. Latent Semantic Indexing La te n t Dirichlet Allocation Blei03 (Latent Dirichlet Allocation model) For each document Choose θ ~ Dirichlet(α) Choose a topic z n ~ Multinomial(θ) Choose a word w n from p(w n z n,β), a multinomial probability conditioned on the topic z w 1 n Topic mixture weights is hidden random variable z 1 θ z 2 z 3 w 2 w 3 z 4 w 4 z 1 w 1 z 2 w 2 α θ β z 3 w 3 z 4 w 4 # parameters: k+kv z 1 w 1 z 2 w 2 θ z 3 w 3 28 z 4 w 4

30 Machine Learning Applications Speech Recognition (HMM, Neural Nets/ Deep Learning, ) Computer Vision (Neural Nets/ Deep Learning, SVM, ) Time Series Prediction (HMM, Gaussian Process, Bayesian, ) Genomics (HMM, SVM, ) Natural Language Processing (HMM, CRF, Bayesian, Deep Learning, ) Information Retrieval (Entropy, SVM, Clustering, ) Medical (Decision Tree, HMM, Bayesian, ) Behavior/ Games (Reinforcement Learning, Bayesian, Deep Learning ) 29

31 Reference Christopher M. Bishop Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Ve rla g New York, Inc., Secaucus, NJ, USA. Tony Jebara, Machine Learning course materials, Department of Computer Science, Columbia University Sergios Theodoridis and Konstantinos Koutroumbas Pattern Recognition, Fourth Edition (4th ed.). Academic Press. Daniel Jurafsky and James H. Martin Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (1st ed.). Prentice Hall PTR, Upper Saddle River, NJ, USA. Jiawei Han Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. Rudin, Cynthia "The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List." Journal of Machine Learning Research 10 (2009) Thomas Hofmann Probabilistic latent semantic analysis. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence (UAI'9 9 ), Kathryn B. Laskey and Henri Prade (Eds.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, David M. Blei, Andrew Y. Ng, Michael I. Jordan Latent Dirichlet Allocation. Journal of Machine Learning Research 3(Jan): Wikipedia pages and images of machine learning topics Other papers published in academic journals and conferences 30

32 31

33 Legal notice 2017 Swiss Re. All rights reserved. You are not permitted to create any modifications or derivative works of this presentation or to use it for commercial or other public purposes without the prior written permission of Swiss Re. The information and opinions contained in the presentation are provided as at the date of the presentation and are subject to change without notice. Although the information used was taken from reliable sources, Swiss Re does not accept any responsibility for the accuracy or comprehensiveness of the details given. All liability for the accuracy and completeness thereof or for any damage or loss resulting from the use of the information contained in this presentation is expressly excluded. Under no circumstances shall Swiss Re or its Group companies be liable for any financial or consequential loss relating to this presentation. 32

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted