Algorithmisches Lernen/Machine Learning

Similar documents
Notes on Machine Learning for and

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Statistical learning. Chapter 20, Sections 1 4 1

Naïve Bayes classification

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Bayesian Learning Features of Bayesian learning methods:

MODULE -4 BAYEIAN LEARNING

Lecture 9: Bayesian Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Bayesian Models for Sequences

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Final Exam, Fall 2002

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Statistical Learning. Philipp Koehn. 10 November 2015

Bayesian Learning. Bayesian Learning Criteria

CSCE 478/878 Lecture 6: Bayesian Learning

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Statistical learning. Chapter 20, Sections 1 3 1

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction to Machine Learning

Nonparametric Bayesian Methods (Gaussian Processes)

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Lecture 3: Pattern Classification

Introduction to Machine Learning

Lecture 4: Probabilistic Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Logistic Regression. Machine Learning Fall 2018

Bayesian Learning (II)

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Bayesian Methods: Naïve Bayes

CPSC 540: Machine Learning

The Naïve Bayes Classifier. Machine Learning Fall 2017

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

10-701/ Machine Learning, Fall

Algorithm-Independent Learning Issues

Machine Learning

Learning with Probabilities

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Introduction to Machine Learning

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Density Estimation. Seungjin Choi

Pattern Recognition and Machine Learning

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Lecture 3: Pattern Classification. Pattern classification

The Expectation Maximization or EM algorithm

Machine Learning, Fall 2012 Homework 2

Bayesian Decision Theory

Statistical learning. Chapter 20, Sections 1 3 1

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

Detection theory 101 ELEC-E5410 Signal Processing for Communications

CSC411 Fall 2018 Homework 5

Machine Learning Lecture 5

MACHINE LEARNING ADVANCED MACHINE LEARNING

CSE 473: Artificial Intelligence Autumn Topics

Midterm: CS 6375 Spring 2015 Solutions

An Introduction to Statistical and Probabilistic Linear Models

Probability Based Learning

Machine Learning

COMP 328: Machine Learning

Machine Learning Gaussian Naïve Bayes Big Picture

STA 4273H: Statistical Machine Learning

Introduction to Machine Learning

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Ch 4. Linear Models for Classification

Introduction to Machine Learning

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Support Vector Machines

Machine Learning for Signal Processing Bayes Classification

Bayesian Deep Learning

Machine Learning Linear Classification. Prof. Matteo Matteucci

ECE521 week 3: 23/26 January 2017

Generative v. Discriminative classifiers Intuition

Bias-Variance Tradeoff

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Dynamic Approaches: The Hidden Markov Model

Bayesian Networks Inference with Probabilistic Graphical Models

Mining Classification Knowledge

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

PATTERN CLASSIFICATION

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Generative v. Discriminative classifiers Intuition

L11: Pattern recognition principles

Machine Learning for Signal Processing Bayes Classification and Regression

Grundlagen der Künstlichen Intelligenz

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Transcription:

Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines Learning of Symbolic Structures Bayesian Learning Dimensionality Reduction Part 3: Jianwei Zhang Function approximation Reinforcement Learning Applications in Robotics Algorithmic Learning: 1

Bayesian Learning Bayesian Reasoning Bayes Optimal Classifier Naïve Bayes Classifier Cost-Sensitive Decisions Modelling with Probability Density Functions Parameter Estimation Bayesian Networks Markov Models Dynamic Bayesian Networks Conditional Random Fields Algorithmic Learning: 2

Bayesian Learning Bayesian Reasoning Bayes Optimal Classifier Naïve Bayes Classifier Cost-Sensitive Decisions Modelling with Probability Density Functions Parameter Estimation Bayesian Networks Markov Models Dynamic Bayesian Networks Conditional Random Fields Algorithmic Learning: 2

Bayesian Reasoning derive the probability of a hypothesis h about some observation x a priori probability: probability of the hypothesis prior to the observation P(h) a posteriori probability: probability of the hypothesis after observation P(h x) observation can have discrete or continuous values continuous values: probability density functions p(h x) instead of probabilities error optimal decision: choose the hypothesis which maximizes the a posteriori probability (MAP-decision) Algorithmic Learning: 3

Bayesian Reasoning a posteriori probability is difficult to estimate Bayes rule provides the missing link P(h, x) = P( x, h) = P(h)P( x h) = P( x)p(h x) P(h x) = P(h)P( x h) P( x) Algorithmic Learning: 4

Bayesian Reasoning classification: using the posterior probability as a target function h MAP = arg max h i H P(h i )P( x h i ) P( x) = arg max h i H P(h i)p( x h i ) simplified form: maximum likelihood decision (e.g. if the priors are uniform) h MAP = arg max h i H P( x h) Algorithmic Learning: 5

Bayesian Reasoning allows to include domain knowledge (prior probabilities) to deal with inconsistent training data to provide probabilistic results (confidence) but: probability distributions have to be estimated usually many parameters Algorithmic Learning: 6

Bayesian Reasoning derived results: Bayesian analysis of learning paradigms may uncover their hidden assumptions, even if they are not probabilistic: Every consistent learner outputs a MAP hypothesis under the assumption of uniform prior probabilities for all hypotheses and deterministic, noise-free training data If the real training data can be assumed to be produced out of ideal ones by adding a normal-distributed noise term, any learner that minimizes the mean-squared error yields a ML hypothesis If an observed Boolean value is a probabilistic function of the input value, minimizing cross entropy in neural networks yields a ML hypothesis Algorithmic Learning: 7

Bayesian Reasoning derived results (cont.): If optimal encodings for the hypotheses and the training data given the hypothesis are chosen, selecting the hypothesis according to the principle of minimal description length gives a MAP hypothesis h MDL = arg min h H L C 1 (h) + L C2 (D h) Algorithmic Learning: 8

Bayesian Learning Bayesian Reasoning Bayes Optimal Classifier Naïve Bayes Classifier Modelling with Probability Density Functions Parameter Estimation Bayesian Networks Markov Models Dynamic Bayesian Networks Conditional Random Fields Algorithmic Learning: 9

Bayes Optimal Classifier Bayes classifier does not always produce a true MAP decision e.g. for composite results hypothesis h 1 h 2 h 3 posterior probability 0.3 0.4 0.3 maximum of posteriors gives h 2 but if a new observation is classified positive by h 2 but negative by h 1 and h 3 the MAP decision would be negative extension of the Bayes classifier to composite decisions separating hypotheses h from decisions v v MAP = arg max v j V P(v j h i )P(h i x) h i H simplification for P(v h) {0, 1} Algorithmic Learning: 10

Bayesian Learning Bayesian Reasoning Bayes Optimal Classifier Naïve Bayes Classifier Cost-Sensitive Decisions Modelling with Probability Density Functions Parameter Estimation Bayesian Networks Markov Models Dynamic Bayesian Networks Conditional Random Fields Algorithmic Learning: 11

Naïve Bayes Classifier Bayes Optimal Classifier is too expensive v MAP = arg max v j V P(v j x) = arg max v j V P(v j x 1, x 2,..., x n ) prohibitively many parameter to estimate independence assumption: P(x i v j ) is independent of P(x k v j ) for i k = arg max v j V P(v j)p(x 1, x 2,..., x n v j ) v NB = arg max v j V P(v j) i P(x i v j ) simple training usually good results Algorithmic Learning: 12

Bayesian Learning Bayesian Reasoning Bayes Optimal Classifier Naïve Bayes Classifier Cost-Sensitive Decisions Modelling with Probability Density Functions Parameter Estimation Bayesian Networks Markov Models Dynamic Bayesian Networks Conditional Random Fields Algorithmic Learning: 13

Cost-Sensitive Decisions error optimal classification not always welcome: highly asymmetric distributions diseases, errors, failures,... priors determine the decision including a cost function into the decision rule c ij cost of predicting i when the true class is j cost matrix C = c 11 c 12... C 1n c 21 c 22... C 2n............ c n1 c n2... C nn Algorithmic Learning: 14

Cost-Sensitive Decisions Bayes classifier with cost function can help to reduce false positives/negatives h( x) = arg min h i H c ij p(h j x) j alternative: biased sampling of training data not really effective Algorithmic Learning: 15

Cost-Sensitive Decisions not every cost matrix is a reasonable one reasonableness conditions correct decisions should be less expensive than incorrect ones c ii < c ij i j a row in the cost matrix should not dominate another one row m dominates row n: j.c mj c nj optimal policy: always decide for the dominated class e.g. asymmetric cost function for diseases: actually not ill actually ill predict not ill 0 1 predict ill 9 0 Algorithmic Learning: 16

Cost-Sensitive Decisions any two-class cost matrix can be changed by adding a constant to every entry (shifting) multiplying every entry with a constant (scaling) without affecting the optimal decision ( c00 c 01 c 10 c 11 ) = ( 0 c01 c 00 /c 10 c 00 1 c 11 c 00 /c 10 c 00 ) actually only one degree of freedom! Algorithmic Learning: 17

Cost-Sensitive Decisions optimal decision require the expected cost of the decision to be larger than the expected cost for the alternative decisions e.g. two-class case P( x) c 10 + P( x) c 11 P( x) c 00 + P( x) c 01 (1 P( x) c 10 + P( x) c 11 (1 P( x) c 00 + P( x) c 01 threshold for making optimal cost-sensitive decisions (1 p ) c 10 + p c 11 = (1 p ) c 00 + p c 01 p c = 10 c 00 c 10 c 00 + c 01 c 11 can be used e.g. in decision tree learning Algorithmic Learning: 18

Cost-Sensitive Decisions costs are a dangerous perspective for many applications e.g. rejecting a good bank loan application is a missed opportunity not an actual loss cost are easily measured against different baselines benefits provides a more natural (uniform) baseline: cash flow costs/benfits are usually not constant for every instance e.g. potential benefit/loss of a defaulted bank loan varies with the amount Algorithmic Learning: 19

Bayesian Learning Bayesian Reasoning Bayes Optimal Classifier Naïve Bayes Classifier Cost-Sensitive Decisions Modelling with Probability Density Functions Parameter Estimation Bayesian Networks Markov Models Conditional Random Fields Algorithmic Learning: 20

Modelling with Probability Density Functions probability density functions p( x v j ) instead of P( x v j ) P( x v j ) is always zero in a continuous domain choosing a distribution class, e.g. Gaussian or Laplace p(x v) = N[x, µ, σ] = 1 e (x µ)2 2σ 2 2πσ p(x v) = L[x, µ, σ] = 1 x µ 2σ e σ parameters: mean µ, variance σ Algorithmic Learning: 21

Modelling with Probability Density Functions distributions for multidimensional observations e.g. multivariate normal distribution p( x v) = N[ x, µ,σ] parameters vector of means µ co-variance matrix Σ Algorithmic Learning: 22

Modelling with Probability Density Functions diagonal covariance matrix uniformly filled (rotation symmetry around the mean) σ ij = { n for i = j 0 else diagonal covariance matrix filled with arbitrary values (reflection symmetry) σ ij = { ni for i = j 0 else completely filled covariance matrix Algorithmic Learning: 23

Modelling with Probability Density Functions diagonal covariance matrix: uncorrelated features relativly small number of parameters to be trained naïve Bayes classifier completely filled covariance matrix: correlated features high number of parameters to be trained Algorithmic Learning: 24

Modelling with Probability Density Functions decorrelation of the features: transformation of the feature space Principal Component Analysis Karhunen-Loève-Transformation Algorithmic Learning: 25

Modelling with Probability Density Functions compromise: mixture densities superposition of several Gaussians with uncorrelated features M p( x v) = c m N[ x, µ m, Σ m ] m=1 Algorithmic Learning: 26

Modelling with Probability Density Functions compromise: mixture densities superposition of several Gaussians with uncorrelated features M p( x v) = c m N[ x, µ m, Σ m ] m=1 Algorithmic Learning: 26

direct estimation of distribution parameters is not possible Algorithmic Learning: iterative refinement (Expectation Maximization) 27 Modelling with Probability Density Functions mixture density functions introduce a hidden variable: Which Gaussian produced the value? two step stochastic process: choosing a mixture randomly { 1 if x z ij = i was generated by p j ( x v) 0 otherwise choosing a value randomly v P(z i1 = 1) P(z i2 = 1) P(z in = 1) p 1 ( x v) p 2 ( x v)... p n( x v)

Bayesian Learning Bayesian Reasoning Bayes Optimal Classifier Naïve Bayes Classifier Cost-Sensitive Decisions Modelling with Probability Density Functions Parameter Estimation Bayesian Networks Markov Models Dynamic Bayesian Networks Conditional Random Fields Algorithmic Learning: 28

Parameter estimation complete data maximum likelihood estimation Bayesian estimation incomplete data expectation maximization (gradient descent techniques) Algorithmic Learning: 29

Maximum Likelihood Estimation likelihood of the model M given the (training) data D L(M D) = d D P(d M) log-likelihood LL(M D) = d D log 2 P(d M) choose among several possible models for describing the data according to the principle of maximum likelihood ˆΘ = arg max Θ L(M Θ D) = arg max LL(M Θ D) Θ the models only differ in the set of parameters Θ Algorithmic Learning: 30

Maximum Likelihood Estimation complete data: estimating the parameters by counting P(A = a) = N(A = a) N(A = v) v dom(a) P(A = a B = b, C = c) = N(A = a, B = b, C = c) N(B = b, C = c) Algorithmic Learning: 31

Bayesian Estimation sparse data bases result in pessimistic estimations for unseen events if the count for an event in the data base is 0, the event ios considered impossible by the model Bayesian estimation: using an estimate of the prior probability as starting point for the counting estimation of maximum a posteriori parameters no zero counts can occur if nothing else available use an even distribution as prior Bayesian estimate in the binary case with an even distribution P(yes) = n + 1 n + m + 2 n: counts for yes, m: counts for no effectively adding virtual counts to the estimate alternative: smoothing as a post processing step Algorithmic Learning: 32

Incomplete Data missing at random: probability that a value is missing depends only on the observed value e.g. confirmation measurement: values are available only if the preceding measurement was positive/negative missing completely at random probability that a value is missing is also independent of the value e.g. stochastic failures of the measurement equipment e.g. hidden/latent variables (mixture coefficients of a Gaussian mixture distribution) nonignorable: neither MAR or MCAR probability depends on the unseen values, e.g. exit polls for extremist parties Algorithmic Learning: 33

Expectation Maximization Estimating the means of a Gaussian mixture distribution choose an initial hypothesis for Θ = (µ 1,..., µ k ) estimate the expected mean E(z ij ) given Θ = (µ 1,..., µ k ) recalculate the maximum likelihood estimate of the means: Θ = (µ 1,..., µ k ) assuming z ij { 1 if x z ij = i was generated by p j ( x v) 0 otherwise replace µ j by µ j and repeat until convergence Algorithmic Learning: 34

Expectation Maximization expectation: complete the data set using the current estimation h = Θ to calculate expectations for the missing values applies the model to be learned (Bayesian inference) maximization: use the completed data set to find a new maximum likelihood estimation h = Θ Algorithmic Learning: 35

Expectation Maximization generalizing the EM framework estimating the underlying distribution of not directly observable variables full data n+1-tuples x i, z i1,..., z ik only x i can be observed training data: X = { x 1,..., x m } hidden information: Z = {z 1,..., z m } parameters of the distribution to be estimated: Θ Z can be treated as random variable with p(z) = f(θ, X) full data: Y = X Z hypothesis: h of Θ, needs to be revised into h Algorithmic Learning: 36

Expectation Maximization goal of EM: h = arg max E(log 2 p(y h )) define a function Q(h h) = E(log 2 p(y h ) h, X) Estimation (E) step Calculate Q(h h) using the current hypothesis h and the observed data X to estimate the probability distribution over Y Q(h h) E(log 2 p(y h ) h, X) Maximization (M) step Replace hypothesis h by h that maximizes the function Q h arg max Q(h h) h Algorithmic Learning: 37

Expectation Maximization expectation step requires applying the model to be learned Bayesian inference gradient ascent search converges to the next local optimum global optimum is not guaranteed Algorithmic Learning: 38

Expectation Maximization Q(h h) E(ln p(y h ) h, X) h Q(h h) h arg max Q(h h) h If Q is continuous, EM converges to the local maximum of the likelihood function P(Y h ) Algorithmic Learning: 39