X={x ij } φ ik =φ k (x i ) Φ={φ ik } Lecture 11: Information Theoretic Methods. Mutual Information as Information Gain. Feature Transforms

Size: px
Start display at page:

Download "X={x ij } φ ik =φ k (x i ) Φ={φ ik } Lecture 11: Information Theoretic Methods. Mutual Information as Information Gain. Feature Transforms"

Transcription

1 Lecture 11: Information Theoretic Methods Isabelle Guyon Mutual Information as Information Gain Book Chapter 6 and Feature Transforms Information, Entropy m x i n ={x ij } φ ik =φ k (x i ) For an event happening with probability p, the quantity of information is log p. If the log is of base 2, the unit is a bit. n (n <n) The average quantity of information over all events is the entropy: m Φ={φ ik } m H = - Σ p log p 1

2 Conditional Entropy Mutual Information Our case: =class labels, Φ feature representation. Entropy: H() = - Σ y p(y) log p(y) Conditional entropy: Information Gain: The amount of class uncertainty reduction by observing Φ is: MI(, Φ) = H() H( Φ) H( Φ)= - f p(f) Σ y p(y f) log p(y f) df MI(, Φ) = Σ y f p(y, f) log p(y, f) p(y) p(f) df Mutual Information MI Estimation H( ) H(, ) MI(,) H( ) P() P()*P() P(, ) pxy=[4/16, 6/16; 5/16, 1/16] px=sum(pxy) py=sum(pxy, 2) pxpy=(py*px) MI=sum(sum(pxy.*log2(pxy./pxpy))) pxy = px = H() H() H(,) = H() + H() MI(,) MI(,) = H()-H( ) = H()-H( ) P() Possible normalization by (H()+H())/2 py = pxpy = MI =

3 Tree Classifiers CART (Breiman, 1984) or C4.5 (Quinlan, 1993) Information Gain H before = -11/19 log(11/19) -8/19 log(8/19) = 0.98 f 2 All the data f 2 All the data Choose f 1 Choose f 2 f 1 At each step, choose the feature that reduces entropy most. Work towards node purity. Choose f 2 11/19 8/19 H left = -4/11 log(4/11) 7/11 log(7/11) = 0.94 f 1 H right = -7/8 log(7/8) 1/8 log(1/8) = 0.54 IG = H before (11/19 H left + 8/19 H right ) = = 0.2 IG = MI Iris Data (Fisher, 1936/Chapter 1) Linear discriminant Tree classifier H before = H() H left = H( f 2 =-1) H right = H( f 2 =1) IG = H before (p(f 2 =-1) H left + p(f 2 =1) H right ) = H() (p(f 2 =-1) H( f 2 =-1) + p(f 2 =1) H( f 2 =1)) = H() H( f 2 ) = MI(,f 2 ) setosa virginica versicolor Gaussian mixture Kernel method (SVM) 3

4 MI for Feature Selection Forward Selection with MI Random Forest (Breiman, Cutler, 2002) Feature ranking Forward selection Feature subset selection (Markov Blankets, scaling factors, shrinkage) Fleuret, Practical only for binary features. Select a first feature?(1) with maximum MI with the target. For each remaining feature i and each previously selected feature?(j), compute the conditional mutual information: CMI( i,?(j) )= Σ?(j) P(?(j) ) MI( i,?(j) ) Select the feature with maximum CMI. Transmission Channels Transmission Channels Noisy channel Channel capacity = max p(x) MI(, ) = (second Shannon theorem) highest achievable information transmission rate without error. 4

5 Coding MI and Error Rate Encoder Noisy channel Fano, 1961 Code efficiency = length(code) / length(original signal) Theorems (Shannon, 1948): Errorless transmission is achievable with sufficiently inefficient (well chosen) codes; the efficiency is bounded by the capacity. For a chosen error rate, one can find a shorter code. Hellman and Raviv (1970) Feder and Merhav (1994) Noisy Channel in PR Noisy Channel in PR This is a four 4 Category Encoder Ideal pattern Noisy channel Real pattern This is a four 4 Encoder Category Ideal pattern Noisy channel Real pattern Decoder Predicted category Preprocessor Features Φ Decoder Predicted category 5

6 Information Bottleneck Tishby et al., 1999 Φ Information Bottleneck Methods maximize MI(, Φ) = MI(Φ,) but minimize MI(, Φ) Gradient Descent Computational Tricks (1/4) Class labels: y High dim. data: x f(q, x) Low dim. Feature vector: f Mutual Information MI(f, y) Feature x construction/ f Posterior p(y f) probability (dim n) selection (dim n ) estimation (dim 1) f(q, x) 1 (1, n θ ) (1, n ) (n, n θ ) MI/ q = MI/ f p(y f) appears here f/ q Gradient: MI/ q Torkkola, 2001 (1, n ) (n, n) (n, 1) For example f(w, x) = W x T (linear model) f/ w=x 6

7 Computational Tricks (2/4) Computational Tricks (3/4) MI(, Φ) = Σ y φ p(y, f) log p(y, f) p(y) p(f) df 2 MI(, Φ) = Σ y φ (p(y, f) - p(y) p(f)) 2 df Justification: Renyi entropy H() = - log Σ y p(y) 2 3 Use densities that are mixtures of Gaussians. Examples of Suitable Densities Parzen Windows (Rosenblatt 1956, Parzen 1962) Gaussian mixture: p(f y) = G(f- f y, Σ y ) p(f) = Σ y p(y) p(f y) =Σ y (m y /m) G(f- f y, Σ y ) p(x) = (1/m)Σ i=1:m G(f- f i, σ m ) Parzen windows: one per class σ m = σ/sqrt(m) p(f y) = (1/m y ) Σ c y G(f- f c, σ) p(f y) = (1/m) Σ i G(f- f i, σ) the same for all one per example From Duda and Hart,

8 Computational Tricks (4/4) Plugging In MI(, Φ) = Σ y φ (p(y, f) - p(y) p(f)) 2 df 4 MI(, Φ) = V in + V all - V bw V in = function ( G(f ci - f cj, 2σ) ) MI(, Φ) = Σ y φ p(y, f) 2 df V in V all = function ( G(f i - f j, 2σ) ) + Σ y φ p(y) 2 p(f) 2 df V all V bw = function ( G(f ci - f j, 2σ) ) - 2 Σ y φ p(y, f) p(y) p(f) df V bw Information Forces Max MI by hill climbing (gradient descent): q(t+1) = q(t) + η MI/ q = q(t) + η MI/ f f/ q = q(t) + η [ V in / f + V all / f - V bw / f] f/ q Force = - gradient (Energy) Energy - MI V in / f Cohesion within classes V all / f Overall cohesion - V bw / f Repulsion to other classes Feature Construction Torkkola, 2001 Example 1:Three classes in 3-dimensional space. Artificial data. 1.a f = linear, p(y f) = Parzen windows 1.b f = RBF, p(y f) = Parzen windows Example 2:Three classes in 12-dimensional space. This is the oil-pipeflowdata from NCRG at Aston University. 2.a f = linear, p(y f) = Parzen windows 2.b f = linear, p(y f) = Gaussian mixture Example 3: Six classes in 36-dimensional space. This is the Landsat satellite image database from Statlogproject. f = linear, p(y f) = Parzen windows 8

9 Application to Feature Selection Φ i = σ i x i scaling factors, σ i 0 s * = argmax s MI(Φ, ) subject to: Σ i σ 2 i = 1 (shear) s * = argmin s MI(, Φ) - β MI(Φ, ) Conclusion MI characterizes the dependency between random variables: Feature ranking Forward selection with CMI MI is also an Information Gain Tree classifiers Random Forests MI characterizes the channel capacity and the rate of distorsion Information bottleneck methods min q MI(, Φ) - β MI(Φ, ) Madelon (continued) Exercise Class Madelon How to review a paper? Exercises on IT methods my_classif=svc({'coef0=1', 'degree=0', 'gamma=1', 'shrinkage=1'}); my_model=chain({probe(relief,{'p_num=2000', 'pval_max=0'}), standardize, my_classif}) f_max=20' Earn 1 point with a result better than the baseline method (testber<7.33 or [testber=7.33 and feat_num<20 (4%)]). Earn a total of 2 points with a result testber<6.79% or [testber=6.79% and feat_num<17 (3.4%)]. 9

10 Homework Review form Write a review of a paper in the Feature Extraction book (the paper you presented or will present or any paper/chapter you want). Questions: What is the goal of a review? What is the goal of publishing? A. Eliminating criteria (a zero means rejection) 1. Scope 2. Novelty 3. Usefulness 4. Sanity B. Other fundamental criteria (a low grade implies major revisions) 1. Quantity 2. Reproducibility 3. Demonstration 4. Comparison Review Form (continued) Review Form (end) C. Forgivable problems of contents (minor revisions may be required) 1. Completeness 2. Take-aways 3. Bibliography 4. Outlook D. Bonus reproducibility criteria (give additional credit) 1. Data availability 2. Code availability E. Message delivery (major/minor revisions may be required) 1. Readability 2. Notations 3. Figures 4. Formalism 5. Density 6. Language 10

11 MI and Pearson Correlation Exercise: prove MI=-1/2 log(1-r 2 ) P() MI(, ) = -(1/2) log(1-r 2 ) P() Prove: Entropy Gaussian source: H() = ½ log(2πe)σ 2 Solution of one-variable least-square regression y=ax: a=σ xy /σ x 2. We will call a the sol. of x=a y. Pearson correlation coefficient: R 2 =aa Residual error σ 2 =E(y-ax) 2 =(1-R 2 )σ y 2. =function()+n, N is random noise MI(,)=H()-H(N) MI(,) H()-H(N) For a Gaussian channel with Gaussian source: MI(,)=-½ log σ 2 /σ y 2. Conclude the proof. Markov Blankets Exercise (Markov Blankets) Notations: All features, Features selected Φ, complement Φ, =[Φ, Φ ]. Definition: A Markov Blanket is a subset of features Φ such that: P(, Φ) = P( Φ) P( Φ) We write - Φ ( is independent of given Φ) If Φ is a MB, prove P( Φ) = P( ) for all assigments of values to Φ. We define DMI(Φ) = E p() D KL (P( ) P( Φ)) Prove: DMI(Φ)=E p(,) P( )/P( Φ) We define A=log[P(,)/(P()P())] and B=log [P( Φ,)/(P( Φ)P())], prove: A-B=P( )/P( Φ) Σ {Φ} P( Φ,)B= Σ {} P(,)B DMI(Φ) = MI(,) MI(Φ,) Define an optimization problem to find a minimum MB. How can you relate it to the information bottleneck approach? 11

12 Complement of exercise What are the similarities and differences between the bottleneck IT methods and the Bayesian/regularized risk methods? Can you suggest hybrid approaches? Can you suggest forward selection and backward elimination IT methods using the scaling factor approach? 12

Variable selection and feature construction using methods related to information theory

Variable selection and feature construction using methods related to information theory Outline Variable selection and feature construction using methods related to information theory Kari 1 1 Intelligent Systems Lab, Motorola, Tempe, AZ IJCNN 2007 Outline Outline 1 Information Theory and

More information

References. Lecture 7: Support Vector Machines. Optimum Margin Perceptron. Perceptron Learning Rule

References. Lecture 7: Support Vector Machines. Optimum Margin Perceptron. Perceptron Learning Rule References Lecture 7: Support Vector Machines Isabelle Guyon guyoni@inf.ethz.ch An training algorithm for optimal margin classifiers Boser-Guyon-Vapnik, COLT, 992 http://www.clopinet.com/isabelle/p apers/colt92.ps.z

More information

CS 630 Basic Probability and Information Theory. Tim Campbell

CS 630 Basic Probability and Information Theory. Tim Campbell CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability Theory Probability Theory is the study of how best to predict outcomes of events. An experiment (or trial or event)

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

PATTERN CLASSIFICATION

PATTERN CLASSIFICATION PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

More information

References. Lecture 3: Shrinkage. The Power of Amnesia. Ockham s Razor

References. Lecture 3: Shrinkage. The Power of Amnesia. Ockham s Razor References Lecture 3: Shrinkage Isabelle Guon guoni@inf.ethz.ch Structural risk minimization for character recognition Isabelle Guon et al. http://clopinet.com/isabelle/papers/sr m.ps.z Kernel Ridge Regression

More information

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information. L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate

More information

Dept. of Linguistics, Indiana University Fall 2015

Dept. of Linguistics, Indiana University Fall 2015 L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

Information Theory and Feature Selection (Joint Informativeness and Tractability)

Information Theory and Feature Selection (Joint Informativeness and Tractability) Information Theory and Feature Selection (Joint Informativeness and Tractability) Leonidas Lefakis Zalando Research Labs 1 / 66 Dimensionality Reduction Feature Construction Construction X 1,..., X D f

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Lecture 7: DecisionTrees

Lecture 7: DecisionTrees Lecture 7: DecisionTrees What are decision trees? Brief interlude on information theory Decision tree construction Overfitting avoidance Regression trees COMP-652, Lecture 7 - September 28, 2009 1 Recall:

More information

Information Theory Primer:

Information Theory Primer: Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen s inequality Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes Support Vector Machine, Random Forests, Boosting December 2, 2012 1 / 1 2 / 1 Neural networks Artificial Neural networks: Networks

More information

Decision Tree Learning Lecture 2

Decision Tree Learning Lecture 2 Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

ECE662: Pattern Recognition and Decision Making Processes: HW TWO ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are

More information

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture : Mutual Information Method Lecturer: Yihong Wu Scribe: Jaeho Lee, Mar, 06 Ed. Mar 9 Quick review: Assouad s lemma

More information

Information Theory, Statistics, and Decision Trees

Information Theory, Statistics, and Decision Trees Information Theory, Statistics, and Decision Trees Léon Bottou COS 424 4/6/2010 Summary 1. Basic information theory. 2. Decision trees. 3. Information theory and statistics. Léon Bottou 2/31 COS 424 4/6/2010

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees Machine Learning Fall 2018 Some slides from Tom Mitchell, Dan Roth and others 1 Key issues in machine learning Modeling How to formulate your problem as a machine learning problem?

More information

Lecture 1: Introduction, Entropy and ML estimation

Lecture 1: Introduction, Entropy and ML estimation 0-704: Information Processing and Learning Spring 202 Lecture : Introduction, Entropy and ML estimation Lecturer: Aarti Singh Scribes: Min Xu Disclaimer: These notes have not been subjected to the usual

More information

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018 Please submit the solutions on Gradescope. EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018 1. Optimal codeword lengths. Although the codeword lengths of an optimal variable length code

More information

Lecture 4 Channel Coding

Lecture 4 Channel Coding Capacity and the Weak Converse Lecture 4 Coding I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw October 15, 2014 1 / 16 I-Hsiang Wang NIT Lecture 4 Capacity

More information

Information Theory - Entropy. Figure 3

Information Theory - Entropy. Figure 3 Concept of Information Information Theory - Entropy Figure 3 A typical binary coded digital communication system is shown in Figure 3. What is involved in the transmission of information? - The system

More information

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon Lecture 2: Enseble Methods Isabelle Guyon guyoni@inf.ethz.ch Introduction Book Chapter 7 Weighted Majority Mixture of Experts/Coittee Assue K experts f, f 2, f K (base learners) x f (x) Each expert akes

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Limits on classical communication from quantum entropy power inequalities

Limits on classical communication from quantum entropy power inequalities Limits on classical communication from quantum entropy power inequalities Graeme Smith, IBM Research (joint work with Robert Koenig) QIP 2013 Beijing Channel Capacity X N Y p(y x) Capacity: bits per channel

More information

Lecture 5 Channel Coding over Continuous Channels

Lecture 5 Channel Coding over Continuous Channels Lecture 5 Channel Coding over Continuous Channels I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw November 14, 2014 1 / 34 I-Hsiang Wang NIT Lecture 5 From

More information

Quiz 2 Date: Monday, November 21, 2016

Quiz 2 Date: Monday, November 21, 2016 10-704 Information Processing and Learning Fall 2016 Quiz 2 Date: Monday, November 21, 2016 Name: Andrew ID: Department: Guidelines: 1. PLEASE DO NOT TURN THIS PAGE UNTIL INSTRUCTED. 2. Write your name,

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

ELEMENT OF INFORMATION THEORY

ELEMENT OF INFORMATION THEORY History Table of Content ELEMENT OF INFORMATION THEORY O. Le Meur olemeur@irisa.fr Univ. of Rennes 1 http://www.irisa.fr/temics/staff/lemeur/ October 2010 1 History Table of Content VERSION: 2009-2010:

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees Machine Learning Spring 2018 1 This lecture: Learning Decision Trees 1. Representation: What are decision trees? 2. Algorithm: Learning decision trees The ID3 algorithm: A greedy

More information

Lecture 14 February 28

Lecture 14 February 28 EE/Stats 376A: Information Theory Winter 07 Lecture 4 February 8 Lecturer: David Tse Scribe: Sagnik M, Vivek B 4 Outline Gaussian channel and capacity Information measures for continuous random variables

More information

Machine learning - HT Maximum Likelihood

Machine learning - HT Maximum Likelihood Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Shannon s noisy-channel theorem

Shannon s noisy-channel theorem Shannon s noisy-channel theorem Information theory Amon Elders Korteweg de Vries Institute for Mathematics University of Amsterdam. Tuesday, 26th of Januari Amon Elders (Korteweg de Vries Institute for

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

ELEC546 Review of Information Theory

ELEC546 Review of Information Theory ELEC546 Review of Information Theory Vincent Lau 1/1/004 1 Review of Information Theory Entropy: Measure of uncertainty of a random variable X. The entropy of X, H(X), is given by: If X is a discrete random

More information

Noisy channel communication

Noisy channel communication Information Theory http://www.inf.ed.ac.uk/teaching/courses/it/ Week 6 Communication channels and Information Some notes on the noisy channel setup: Iain Murray, 2012 School of Informatics, University

More information

Module 1. Introduction to Digital Communications and Information Theory. Version 2 ECE IIT, Kharagpur

Module 1. Introduction to Digital Communications and Information Theory. Version 2 ECE IIT, Kharagpur Module ntroduction to Digital Communications and nformation Theory Lesson 3 nformation Theoretic Approach to Digital Communications After reading this lesson, you will learn about Scope of nformation Theory

More information

Undirected Graphical Models: Markov Random Fields

Undirected Graphical Models: Markov Random Fields Undirected Graphical Models: Markov Random Fields 40-956 Advanced Topics in AI: Probabilistic Graphical Models Sharif University of Technology Soleymani Spring 2015 Markov Random Field Structure: undirected

More information

Lecture 21: Minimax Theory

Lecture 21: Minimax Theory Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways

More information

Lecture 7. Union bound for reducing M-ary to binary hypothesis testing

Lecture 7. Union bound for reducing M-ary to binary hypothesis testing Lecture 7 Agenda for the lecture M-ary hypothesis testing and the MAP rule Union bound for reducing M-ary to binary hypothesis testing Introduction of the channel coding problem 7.1 M-ary hypothesis testing

More information

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University 2018 CS420, Machine Learning, Lecture 5 Tree Models Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html ML Task: Function Approximation Problem setting

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /

More information

EE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm

EE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm EE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm 1. Feedback does not increase the capacity. Consider a channel with feedback. We assume that all the recieved outputs are sent back immediately

More information

NAME... Soc. Sec. #... Remote Location... (if on campus write campus) FINAL EXAM EE568 KUMAR. Sp ' 00

NAME... Soc. Sec. #... Remote Location... (if on campus write campus) FINAL EXAM EE568 KUMAR. Sp ' 00 NAME... Soc. Sec. #... Remote Location... (if on campus write campus) FINAL EXAM EE568 KUMAR Sp ' 00 May 3 OPEN BOOK exam (students are permitted to bring in textbooks, handwritten notes, lecture notes

More information

CS540 ANSWER SHEET

CS540 ANSWER SHEET CS540 ANSWER SHEET Name Email 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 1 2 Final Examination CS540-1: Introduction to Artificial Intelligence Fall 2016 20 questions, 5 points

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis Week 5 Based in part on slides from textbook, slides of Susan Holmes Part I Linear Discriminant Analysis October 29, 2012 1 / 1 2 / 1 Nearest centroid rule Suppose we break down our data matrix as by the

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Decision Trees Tobias Scheffer Decision Trees One of many applications: credit risk Employed longer than 3 months Positive credit

More information

Chaos, Complexity, and Inference (36-462)

Chaos, Complexity, and Inference (36-462) Chaos, Complexity, and Inference (36-462) Lecture 7: Information Theory Cosma Shalizi 3 February 2009 Entropy and Information Measuring randomness and dependence in bits The connection to statistics Long-run

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy Coding and Information Theory Chris Williams, School of Informatics, University of Edinburgh Overview What is information theory? Entropy Coding Information Theory Shannon (1948): Information theory is

More information

Lecture 11: Polar codes construction

Lecture 11: Polar codes construction 15-859: Information Theory and Applications in TCS CMU: Spring 2013 Lecturer: Venkatesan Guruswami Lecture 11: Polar codes construction February 26, 2013 Scribe: Dan Stahlke 1 Polar codes: recap of last

More information

p(d θ ) l(θ ) 1.2 x x x

p(d θ ) l(θ ) 1.2 x x x p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to

More information

DATA MINING LECTURE 9. Minimum Description Length Information Theory Co-Clustering

DATA MINING LECTURE 9. Minimum Description Length Information Theory Co-Clustering DATA MINING LECTURE 9 Minimum Description Length Information Theory Co-Clustering MINIMUM DESCRIPTION LENGTH Occam s razor Most data mining tasks can be described as creating a model for the data E.g.,

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with Feature selection c Victor Kitov v.v.kitov@yandex.ru Summer school on Machine Learning in High Energy Physics in partnership with August 2015 1/38 Feature selection Feature selection is a process of selecting

More information

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes November 9, 2012 1 / 1 Nearest centroid rule Suppose we break down our data matrix as by the labels yielding (X

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing

+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing Linear recurrent networks Simpler, much more amenable to analytic treatment E.g. by choosing + ( + ) = Firing rates can be negative Approximates dynamics around fixed point Approximation often reasonable

More information

Entropies & Information Theory

Entropies & Information Theory Entropies & Information Theory LECTURE I Nilanjana Datta University of Cambridge,U.K. See lecture notes on: http://www.qi.damtp.cam.ac.uk/node/223 Quantum Information Theory Born out of Classical Information

More information

Lecture Notes for Statistics 311/Electrical Engineering 377. John Duchi

Lecture Notes for Statistics 311/Electrical Engineering 377. John Duchi Lecture Notes for Statistics 311/Electrical Engineering 377 March 13, 019 Contents 1 Introduction and setting 6 1.1 Information theory..................................... 6 1. Moving to statistics....................................

More information

MAHALAKSHMI ENGINEERING COLLEGE-TRICHY QUESTION BANK UNIT V PART-A. 1. What is binary symmetric channel (AUC DEC 2006)

MAHALAKSHMI ENGINEERING COLLEGE-TRICHY QUESTION BANK UNIT V PART-A. 1. What is binary symmetric channel (AUC DEC 2006) MAHALAKSHMI ENGINEERING COLLEGE-TRICHY QUESTION BANK SATELLITE COMMUNICATION DEPT./SEM.:ECE/VIII UNIT V PART-A 1. What is binary symmetric channel (AUC DEC 2006) 2. Define information rate? (AUC DEC 2007)

More information

Channel capacity. Outline : 1. Source entropy 2. Discrete memoryless channel 3. Mutual information 4. Channel capacity 5.

Channel capacity. Outline : 1. Source entropy 2. Discrete memoryless channel 3. Mutual information 4. Channel capacity 5. Channel capacity Outline : 1. Source entropy 2. Discrete memoryless channel 3. Mutual information 4. Channel capacity 5. Exercices Exercise session 11 : Channel capacity 1 1. Source entropy Given X a memoryless

More information

Machine Learning Lecture 2

Machine Learning Lecture 2 Machine Perceptual Learning and Sensory Summer Augmented 15 Computing Many slides adapted from B. Schiele Machine Learning Lecture 2 Probability Density Estimation 16.04.2015 Bastian Leibe RWTH Aachen

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Convex Optimization M2

Convex Optimization M2 Convex Optimization M2 Lecture 8 A. d Aspremont. Convex Optimization M2. 1/57 Applications A. d Aspremont. Convex Optimization M2. 2/57 Outline Geometrical problems Approximation problems Combinatorial

More information

Lecture 2: August 31

Lecture 2: August 31 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy

More information

Kernel Methods & Support Vector Machines

Kernel Methods & Support Vector Machines Kernel Methods & Support Vector Machines Mahdi pakdaman Naeini PhD Candidate, University of Tehran Senior Researcher, TOSAN Intelligent Data Miners Outline Motivation Introduction to pattern recognition

More information

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye Chapter 2: Entropy and Mutual Information Chapter 2 outline Definitions Entropy Joint entropy, conditional entropy Relative entropy, mutual information Chain rules Jensen s inequality Log-sum inequality

More information

Lecture 3: Pattern Classification

Lecture 3: Pattern Classification EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures

More information

9 Forward-backward algorithm, sum-product on factor graphs

9 Forward-backward algorithm, sum-product on factor graphs Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 9 Forward-backward algorithm, sum-product on factor graphs The previous

More information

Dan Roth 461C, 3401 Walnut

Dan Roth   461C, 3401 Walnut CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn

More information

Information & Correlation

Information & Correlation Information & Correlation Jilles Vreeken 11 June 2014 (TADA) Questions of the day What is information? How can we measure correlation? and what do talking drums have to do with this? Bits and Pieces What

More information

Multivariate statistical methods and data mining in particle physics

Multivariate statistical methods and data mining in particle physics Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general

More information

Lecture 5: Classification

Lecture 5: Classification Lecture 5: Classification Advanced Applied Multivariate Analysis STAT 2221, Spring 2015 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department of Mathematical Sciences Binghamton

More information

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2 Yasemin Altun January 26, 2007 Review of Inference on Graphical Models Elimination algorithm finds single

More information

EE376A: Homeworks #4 Solutions Due on Thursday, February 22, 2018 Please submit on Gradescope. Start every question on a new page.

EE376A: Homeworks #4 Solutions Due on Thursday, February 22, 2018 Please submit on Gradescope. Start every question on a new page. EE376A: Homeworks #4 Solutions Due on Thursday, February 22, 28 Please submit on Gradescope. Start every question on a new page.. Maximum Differential Entropy (a) Show that among all distributions supported

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

INTRODUCTION TO INFORMATION THEORY

INTRODUCTION TO INFORMATION THEORY INTRODUCTION TO INFORMATION THEORY KRISTOFFER P. NIMARK These notes introduce the machinery of information theory which is a eld within applied mathematics. The material can be found in most textbooks

More information

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code Chapter 3 Source Coding 3. An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code 3. An Introduction to Source Coding Entropy (in bits per symbol) implies in average

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Computing and Communications 2. Information Theory -Entropy

Computing and Communications 2. Information Theory -Entropy 1896 1920 1987 2006 Computing and Communications 2. Information Theory -Entropy Ying Cui Department of Electronic Engineering Shanghai Jiao Tong University, China 2017, Autumn 1 Outline Entropy Joint entropy

More information

Mining Classification Knowledge

Mining Classification Knowledge Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology SE lecture revision 2013 Outline 1. Bayesian classification

More information