Conditional Likelihood Maximization: A Unifying Framework for Information Theoretic Feature Selection

Size: px
Start display at page:

Download "Conditional Likelihood Maximization: A Unifying Framework for Information Theoretic Feature Selection"

Transcription

1 Conditional Likelihood Maximization: A Unifying Framework for Information Theoretic Feature Selection Gavin Brown, Adam Pocock, Mingjie Zhao and Mikel Lujan School of Computer Science University of Manchester Presented by Wenzhao Lian July 27, 2012

2 Outline 1 Main Contribution 2 Background 3 Main work 4 Experiments 5 Conclusion

3 Main Contribution Feature selection problem: selecting the feature set which is most relevant and least redundant. What s the criterion of selection? Existing criteria provide scoring functions to measure relevancy and redundancy of features.

4 Main Contribution Feature selection problem: selecting the feature set which is most relevant and least redundant. What s the criterion of selection? Existing criteria provide scoring functions to measure relevancy and redundancy of features. In this paper: Deriving a scoring function, instead of defining. Proposing a unifying framework for information theoretic feature selection. This general criterion can be naturally extended to existing criteria under different assumptions.

5 Outline 1 Main Contribution 2 Background 3 Main work 4 Experiments 5 Conclusion

6 Background Entropy and Mutual Information H(X) = x X p(x)logp(x) H(X Y ) = y Y p(y) x X p(x y)logp(x y) I(X; Y ) = H(X) H(X Y ) = p(xy)log p(xy) p(x)p(y) x X y Y I(X; Y Z ) = H(X Z ) H(X YZ ) = p(z) p(xy z) p(xy z)log p(x z)p(y z) z Z x X y Y (1)

7 Previous Feature Selection Criteria Mutual Information Maximization (MIM) J mim (X k ) = I(X k ; Y ) (2) J mim : relevance index. X k : k th feature. Y : class label. Mutual Information Feature Selection (MIFS) J mifs (X k ) = I(X k ; Y ) β X j S I(X k ; X j ) (3) J mifs : relevance index. S: set of currently selected features. β controlling redundancy penalty. Joint Mutual Information (JMI) J jmi (X k ) = X j S I(X k X j ; Y ) (4) Indicating that the candidate feature which is complementary with existing features should be included.

8 Outline 1 Main Contribution 2 Background 3 Main work 4 Experiments 5 Conclusion

9 Conditional Likelihood Problem D = {x i, y i ; i = 1..N} x i = [x i 1, x i 2,..., x i d ]T x = {x θ, x θ } τ: parameters used to predict y Conditional log likelihood of the labels given parameters θ, τ is l = 1 N N i=1 logq(y i x i θ, τ) (5)

10 Conditional Likelihood Problem Introduce p(y x θ ) and p(y x): the true distribution of the class labels given the selected features x θ and of the class labels given all features. l = 1 N N i=1 log q(y i x i θ, τ) p(y i x i θ ) + 1 N N i=1 log p(y i x i θ ) p(y i x i ) + 1 N N logp(y i x i ) (6) i=1

11 Conditional Likelihood Problem Introduce p(y x θ ) and p(y x): the true distribution of the class labels given the selected features x θ and of the class labels given all features. l = 1 N N i=1 log q(y i x i θ, τ) p(y i x i θ ) + 1 N N i=1 log p(y i x i θ ) p(y i x i ) + 1 N N logp(y i x i ) (6) i=1 Taking the limit, the objective function becomes minimizing l = E xy {log p(y x θ) q(y x θ, τ) } + I(X θ; Y X θ ) + H(Y X) (7) The first term depends on the model. The final term gives a lower bound on the Bayes error. Based on the Filter assumption, which means optimizing the feature set and optimizing the classifier are two independent stages, we can minimize the second term not caring about the first term.

12 Conditional Likelihood Problem For the second term, we have I(X θ ; Y X θ) = I(X; Y ) I(X θ ; Y ) (8) Thus, minimizing I(X θ ; Y X θ) equals to maximizing I(X θ ; Y ). Using the greedy approach First, initialize the selected set as a null set. Then, at each step the feature that has the highest score is selected. Repeat the second step until a stopping criterion is reached. S is the currently selected set, and the score for a feature X k is J cmi (X k ) = I(X k ; Y S) (9)

13 Unifying criteion To bring score functions proposed in previous work into this framework, three assumptions are needed. Assumption 1 For all unselected features X k X θ, assume p(x θ x k ) = j S p(x j x k ) p(x θ x k y) = j S p(x j x k y) (10) Under Assumption 1, an equivalent criterion can be written as J cmi (X k) = I(X k ; Y ) I(X j ; X k ) + I(X j ; X k Y ) (11) j S j S

14 Unifying criteion Assumption 2 For all features, assume Assumption 3 For all features, assume p(x i x j y) = p(x i y)p(x j y) (12) p(x i x j ) = p(x i )p(x j ) (13) Depending on how strong the belief in Assumption 2 and 3 is, different criteria are obtained. J mim (X k ) = I(X k ; Y ) J mifs (X k ) = I(X k ; Y ) β X j S I(X k ; X j ) J mrmr (X k ) = I(X k ; Y ) 1 S J jmi (X k ) = I(X k ; Y ) 1 S I(X k ; X j ) X j S [I(X k ; X j ) I(X k ; X j Y )] X j S (14)

15 Unifying criteion A general form of the unifying criterion: J cmi (X k) = I(X k ; Y ) β I(X j ; X k ) + γ I(X j ; X k Y ) (15) BROWN, POCOCK, j S ZHAO AND LUJÁN j S Figure 2: The full space of linear filter criteria, describing several examples from Table 1. Note that all criteria Figure: in this space Theadopt full Assumption space of1. linear Additionally, criteria the γ and β axes represent the criteria belief in Assumptions 2 and 3, respectively. The left hand axis is where

16 Outline 1 Main Contribution 2 Background 3 Main work 4 Experiments 5 Conclusion

17 Criteria Criteria: Stability or Consistency Similarity between different methods Performance in limited and extreme small-sample situations. Stability and Accuracy Tradeoff

18 Criteria Criteria: Stability or Consistency Similarity between different methods Performance in limited and extreme small-sample situations. Stability and Accuracy Tradeoff Classifier: A nearest neighbour classifier (k=3) is used.

19 Stability BROWN, POCOCK, ZHAO AND LUJÁN Figure 3: Kuncheva s Stability Index across 15 data sets. The box indicates the upper/lower quartiles, the horizontal line within each shows the median value, while the dotted crossbars Figure: Stability Comparison indicate the maximum/minimum values. For convenience of interpretation, criteria on the x-axis are ordered by their median value.

20 Similarity FEATURE SELECTION VIA CONDITIONAL LIKEL (a) Kuncheva s Consistency Index. (b) Yu et al s Figure 5: Relations between feature sets generated by different crit Figure: Stability Comparison sets. 2-D visualisation generated by classical multi-dimens

21 Limited and Extreme Small-sample Figure: Limited and Extreme Small-sample

22 Stability Accuracy Tradeoff Figure: Stability Accuracy Tradeoff

23 Outline 1 Main Contribution 2 Background 3 Main work 4 Experiments 5 Conclusion

24 Conclusion Present a unifying framework for information theoretic feature selection via optimization of the conditional likelihood. Clarify the implicit assumptions made when using different feature selection criteria. Conduct empirical study on 9 heuristic mutual information criteria across data sets to analyze their properties.

FEATURE SELECTION VIA JOINT LIKELIHOOD

FEATURE SELECTION VIA JOINT LIKELIHOOD FEATURE SELECTION VIA JOINT LIKELIHOOD A thesis submitted to the University of Manchester for the degree of Doctor of Philosophy in the Faculty of Engineering and Physical Sciences 2012 By Adam C Pocock

More information

Some Thoughts at the Interface of Ensemble Methods & Feature Selection. Gavin Brown School of Computer Science University of Manchester, UK

Some Thoughts at the Interface of Ensemble Methods & Feature Selection. Gavin Brown School of Computer Science University of Manchester, UK Some Thoughts at the Interface of Ensemble Methods & Feature Selection Gavin Brown School of Computer Science University of Manchester, UK Two Lands England and America are two lands separated by a common

More information

Robust Monte Carlo Methods for Sequential Planning and Decision Making

Robust Monte Carlo Methods for Sequential Planning and Decision Making Robust Monte Carlo Methods for Sequential Planning and Decision Making Sue Zheng, Jason Pacheco, & John Fisher Sensing, Learning, & Inference Group Computer Science & Artificial Intelligence Laboratory

More information

MSc Project Feature Selection using Information Theoretic Techniques. Adam Pocock

MSc Project Feature Selection using Information Theoretic Techniques. Adam Pocock MSc Project Feature Selection using Information Theoretic Techniques Adam Pocock pococka4@cs.man.ac.uk 15/08/2008 Abstract This document presents a investigation into 3 different areas of feature selection,

More information

CS 630 Basic Probability and Information Theory. Tim Campbell

CS 630 Basic Probability and Information Theory. Tim Campbell CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability Theory Probability Theory is the study of how best to predict outcomes of events. An experiment (or trial or event)

More information

Information Theory and Communication

Information Theory and Communication Information Theory and Communication Ritwik Banerjee rbanerjee@cs.stonybrook.edu c Ritwik Banerjee Information Theory and Communication 1/8 General Chain Rules Definition Conditional mutual information

More information

Communication Theory and Engineering

Communication Theory and Engineering Communication Theory and Engineering Master's Degree in Electronic Engineering Sapienza University of Rome A.A. 018-019 Information theory Practice work 3 Review For any probability distribution, we define

More information

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye Chapter 2: Entropy and Mutual Information Chapter 2 outline Definitions Entropy Joint entropy, conditional entropy Relative entropy, mutual information Chain rules Jensen s inequality Log-sum inequality

More information

Lecture 14 February 28

Lecture 14 February 28 EE/Stats 376A: Information Theory Winter 07 Lecture 4 February 8 Lecturer: David Tse Scribe: Sagnik M, Vivek B 4 Outline Gaussian channel and capacity Information measures for continuous random variables

More information

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015 10-704 Homework 1 Due: Thursday 2/5/2015 Instructions: Turn in your homework in class on Thursday 2/5/2015 1. Information Theory Basics and Inequalities C&T 2.47, 2.29 (a) A deck of n cards in order 1,

More information

Noisy channel communication

Noisy channel communication Information Theory http://www.inf.ed.ac.uk/teaching/courses/it/ Week 6 Communication channels and Information Some notes on the noisy channel setup: Iain Murray, 2012 School of Informatics, University

More information

Introduction of Recruit

Introduction of Recruit Apr. 11, 2018 Introduction of Recruit We provide various kinds of online services from job search to hotel reservations across the world. Housing Beauty Travel Life & Local O2O Education Automobile Bridal

More information

Optimization of Classifier Chains via Conditional Likelihood Maximization

Optimization of Classifier Chains via Conditional Likelihood Maximization Optimization of Classifier Chains via Conditional Likelihood Maximization Lu Sun, Mineichi Kudo Graduate School of Information Science and Technology, Hokkaido University, Sapporo 060-0814, Japan Abstract

More information

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang Machine Learning Lecture 02.2: Basics of Information Theory Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology Nevin L. Zhang

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag Decision Trees Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Supervised Learning Input: labelled training data i.e., data plus desired output Assumption:

More information

Information Theory Primer:

Information Theory Primer: Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen s inequality Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

QB LECTURE #4: Motif Finding

QB LECTURE #4: Motif Finding QB LECTURE #4: Motif Finding Adam Siepel Nov. 20, 2015 2 Plan for Today Probability models for binding sites Scoring and detecting binding sites De novo motif finding 3 Transcription Initiation Chromatin

More information

INFORMATION-THEORETIC FEATURE SELECTION IN MICROARRAY DATA USING VARIABLE COMPLEMENTARITY

INFORMATION-THEORETIC FEATURE SELECTION IN MICROARRAY DATA USING VARIABLE COMPLEMENTARITY INFORMATION-THEORETIC FEATURE SELECTION IN MICROARRAY DATA USING VARIABLE COMPLEMENTARITY PATRICK E. MEYER, COLAS SCHRETTER AND GIANLUCA BONTEMPI {pmeyer,cschrett,gbonte}@ulb.ac.be http://www.ulb.ac.be/di/mlg/

More information

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity Eckehard Olbrich MPI MiS Leipzig Potsdam WS 2007/08 Olbrich (Leipzig) 26.10.2007 1 / 18 Overview 1 Summary

More information

4.1 Notation and probability review

4.1 Notation and probability review Directed and undirected graphical models Fall 2015 Lecture 4 October 21st Lecturer: Simon Lacoste-Julien Scribe: Jaime Roquero, JieYing Wu 4.1 Notation and probability review 4.1.1 Notations Let us recall

More information

Simple strategies for semi-supervised feature selection

Simple strategies for semi-supervised feature selection Mach Learn (2018) 107:357 395 https://doi.org/10.1007/s10994-017-5648-2 Simple strategies for semi-supervised feature selection Konstantinos Sechidis 1 Gavin Brown 1 Received: 18 April 2016 / Accepted:

More information

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information. L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate

More information

Dept. of Linguistics, Indiana University Fall 2015

Dept. of Linguistics, Indiana University Fall 2015 L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission

More information

Lecture 5 - Information theory

Lecture 5 - Information theory Lecture 5 - Information theory Jan Bouda FI MU May 18, 2012 Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 1 / 42 Part I Uncertainty and entropy Jan Bouda (FI MU) Lecture 5 - Information

More information

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H. Appendix A Information Theory A.1 Entropy Shannon (Shanon, 1948) developed the concept of entropy to measure the uncertainty of a discrete random variable. Suppose X is a discrete random variable that

More information

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014 Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of

More information

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2 COMPSCI 650 Applied Information Theory Jan 21, 2016 Lecture 2 Instructor: Arya Mazumdar Scribe: Gayane Vardoyan, Jong-Chyi Su 1 Entropy Definition: Entropy is a measure of uncertainty of a random variable.

More information

Probability and Inference

Probability and Inference Deniz Yuret ECOE 554 Lecture 3 Outline 1 Probabilities and ensembles 2 3 Ensemble An ensemble X is a triple (x, A X, P X ), where the outcome x is the value of a random variable, which takes on one of

More information

Lecture 4 October 18th

Lecture 4 October 18th Directed and undirected graphical models Fall 2017 Lecture 4 October 18th Lecturer: Guillaume Obozinski Scribe: In this lecture, we will assume that all random variables are discrete, to keep notations

More information

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye Chapter 8: Differential entropy Chapter 8 outline Motivation Definitions Relation to discrete entropy Joint and conditional differential entropy Relative entropy and mutual information Properties AEP for

More information

Bioinformatics: Biology X

Bioinformatics: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA Model Building/Checking, Reverse Engineering, Causality Outline 1 Bayesian Interpretation of Probabilities 2 Where (or of what)

More information

Tufts COMP 135: Introduction to Machine Learning

Tufts COMP 135: Introduction to Machine Learning Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)

More information

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy Coding and Information Theory Chris Williams, School of Informatics, University of Edinburgh Overview What is information theory? Entropy Coding Information Theory Shannon (1948): Information theory is

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with Feature selection c Victor Kitov v.v.kitov@yandex.ru Summer school on Machine Learning in High Energy Physics in partnership with August 2015 1/38 Feature selection Feature selection is a process of selecting

More information

Probability and Estimation. Alan Moses

Probability and Estimation. Alan Moses Probability and Estimation Alan Moses Random variables and probability A random variable is like a variable in algebra (e.g., y=e x ), but where at least part of the variability is taken to be stochastic.

More information

Ch. 8 Math Preliminaries for Lossy Coding. 8.5 Rate-Distortion Theory

Ch. 8 Math Preliminaries for Lossy Coding. 8.5 Rate-Distortion Theory Ch. 8 Math Preliminaries for Lossy Coding 8.5 Rate-Distortion Theory 1 Introduction Theory provide insight into the trade between Rate & Distortion This theory is needed to answer: What do typical R-D

More information

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 14: Information Theoretic Methods Lecturer: Jiaming Xu Scribe: Hilda Ibriga, Adarsh Barik, December 02, 2016 Outline f-divergence

More information

Decision trees COMS 4771

Decision trees COMS 4771 Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).

More information

11. Learning graphical models

11. Learning graphical models Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. Lecture 4 Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza October 6, 2017 Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 1 / 46 Outline

More information

01 Probability Theory and Statistics Review

01 Probability Theory and Statistics Review NAVARCH/EECS 568, ROB 530 - Winter 2018 01 Probability Theory and Statistics Review Maani Ghaffari January 08, 2018 Last Time: Bayes Filters Given: Stream of observations z 1:t and action data u 1:t Sensor/measurement

More information

CSCE 471/871 Lecture 3: Markov Chains and

CSCE 471/871 Lecture 3: Markov Chains and and and 1 / 26 sscott@cse.unl.edu 2 / 26 Outline and chains models (s) Formal definition Finding most probable state path (Viterbi algorithm) Forward and backward algorithms State sequence known State

More information

Data-Efficient Information-Theoretic Test Selection

Data-Efficient Information-Theoretic Test Selection Data-Efficient Information-Theoretic Test Selection Marianne Mueller 1,Rómer Rosales 2, Harald Steck 2, Sriram Krishnan 2,BharatRao 2, and Stefan Kramer 1 1 Technische Universität München, Institut für

More information

Information Theory and Feature Selection (Joint Informativeness and Tractability)

Information Theory and Feature Selection (Joint Informativeness and Tractability) Information Theory and Feature Selection (Joint Informativeness and Tractability) Leonidas Lefakis Zalando Research Labs 1 / 66 Dimensionality Reduction Feature Construction Construction X 1,..., X D f

More information

Classification & Information Theory Lecture #8

Classification & Information Theory Lecture #8 Classification & Information Theory Lecture #8 Introduction to Natural Language Processing CMPSCI 585, Fall 2007 University of Massachusetts Amherst Andrew McCallum Today s Main Points Automatically categorizing

More information

Information. = more information was provided by the outcome in #2

Information. = more information was provided by the outcome in #2 Outline First part based very loosely on [Abramson 63]. Information theory usually formulated in terms of information channels and coding will not discuss those here.. Information 2. Entropy 3. Mutual

More information

A Gentle Tutorial on Information Theory and Learning. Roni Rosenfeld. Carnegie Mellon University

A Gentle Tutorial on Information Theory and Learning. Roni Rosenfeld. Carnegie Mellon University A Gentle Tutorial on Information Theory and Learning Roni Rosenfeld Mellon University Mellon Outline First part based very loosely on [Abramson 63]. Information theory usually formulated in terms of information

More information

EE/ACM Applications of Convex Optimization in Signal Processing and Communications Lecture 18

EE/ACM Applications of Convex Optimization in Signal Processing and Communications Lecture 18 EE/ACM 150 - Applications of Convex Optimization in Signal Processing and Communications Lecture 18 Andre Tkacenko Signal Processing Research Group Jet Propulsion Laboratory May 31, 2012 Andre Tkacenko

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability

More information

Wavelet Decomposition for Time Series: Determining Input Model by Using mrmr Criterion

Wavelet Decomposition for Time Series: Determining Input Model by Using mrmr Criterion Hacettepe Journal of Mathematics and Statistics 1 Volume 43(3)(2014), 1 12 Wavelet Decomposition for Time Series: Determining Input Model by Using mrmr Criterion Budi Warsito 1, Subanar 2 and Abdurakhman

More information

Information Theory, Statistics, and Decision Trees

Information Theory, Statistics, and Decision Trees Information Theory, Statistics, and Decision Trees Léon Bottou COS 424 4/6/2010 Summary 1. Basic information theory. 2. Decision trees. 3. Information theory and statistics. Léon Bottou 2/31 COS 424 4/6/2010

More information

Bayes Decision Rule and Naïve Bayes Classifier

Bayes Decision Rule and Naïve Bayes Classifier Bayes Decision Rule and Naïve Bayes Classifier Le Song Machine Learning I CSE 6740, Fall 2013 Gaussian Mixture odel A density odel p(x) ay be ulti-odal: odel it as a ixture of uni-odal distributions (e.g.

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Machine Learning, Fall 2009: Midterm

Machine Learning, Fall 2009: Midterm 10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I SYDE 372 Introduction to Pattern Recognition Probability Measures for Classification: Part I Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 Why use probability

More information

5 Mutual Information and Channel Capacity

5 Mutual Information and Channel Capacity 5 Mutual Information and Channel Capacity In Section 2, we have seen the use of a quantity called entropy to measure the amount of randomness in a random variable. In this section, we introduce several

More information

10601 Machine Learning

10601 Machine Learning 10601 Machine Learning September 2, 2009 Recitation 2 Öznur Taştan 1 Logistics Homework 2 is going to be out tomorrow. It is due on Sep 16, Wed. There is no class on Monday Sep 7 th (Labor day) Those who

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features Y target

More information

X={x ij } φ ik =φ k (x i ) Φ={φ ik } Lecture 11: Information Theoretic Methods. Mutual Information as Information Gain. Feature Transforms

X={x ij } φ ik =φ k (x i ) Φ={φ ik } Lecture 11: Information Theoretic Methods. Mutual Information as Information Gain. Feature Transforms Lecture 11: Information Theoretic Methods Isabelle Guyon guyoni@inf.ethz.ch Mutual Information as Information Gain Book Chapter 6 and http://www.jmlr.org/papers/volume3/torkkola03a/torkkola03a.pdf Feature

More information

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 8 Feb. 12, 2018 1 10-601 Introduction

More information

Logistic Regression. William Cohen

Logistic Regression. William Cohen Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting

More information

Variable selection and feature construction using methods related to information theory

Variable selection and feature construction using methods related to information theory Outline Variable selection and feature construction using methods related to information theory Kari 1 1 Intelligent Systems Lab, Motorola, Tempe, AZ IJCNN 2007 Outline Outline 1 Information Theory and

More information

Probabilistic Machine Learning

Probabilistic Machine Learning Probabilistic Machine Learning Bayesian Nets, MCMC, and more Marek Petrik 4/18/2017 Based on: P. Murphy, K. (2012). Machine Learning: A Probabilistic Perspective. Chapter 10. Conditional Independence Independent

More information

A brief introduction to Conditional Random Fields

A brief introduction to Conditional Random Fields A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood

More information

10 Robotic Exploration and Information Gathering

10 Robotic Exploration and Information Gathering NAVARCH/EECS 568, ROB 530 - Winter 2018 10 Robotic Exploration and Information Gathering Maani Ghaffari April 2, 2018 Robotic Information Gathering: Exploration and Monitoring In information gathering

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

LECTURE 2. Convexity and related notions. Last time: mutual information: definitions and properties. Lecture outline

LECTURE 2. Convexity and related notions. Last time: mutual information: definitions and properties. Lecture outline LECTURE 2 Convexity and related notions Last time: Goals and mechanics of the class notation entropy: definitions and properties mutual information: definitions and properties Lecture outline Convexity

More information

Simple stopping criteria for information theoretic feature selection

Simple stopping criteria for information theoretic feature selection Simple stopping criteria for information theoretic feature selection 1 Shujian Yu and José C. Príncipe Abstract arxiv:1811.11971v2 [cs.cv] 29 Jan 2019 Feature selection aims to select the smallest feature

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

Why is Deep Learning so effective?

Why is Deep Learning so effective? Ma191b Winter 2017 Geometry of Neuroscience The unreasonable effectiveness of deep learning This lecture is based entirely on the paper: Reference: Henry W. Lin and Max Tegmark, Why does deep and cheap

More information

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI Bayes Classifiers CAP5610 Machine Learning Instructor: Guo-Jun QI Recap: Joint distributions Joint distribution over Input vector X = (X 1, X 2 ) X 1 =B or B (drinking beer or not) X 2 = H or H (headache

More information

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence

More information

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation Lecture 15. Pattern Classification (I): Statistical Formulation Outline Statistical Pattern Recognition Maximum Posterior Probability (MAP) Classifier Maximum Likelihood (ML) Classifier K-Nearest Neighbor

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Capacity of a channel Shannon s second theorem. Information Theory 1/33

Capacity of a channel Shannon s second theorem. Information Theory 1/33 Capacity of a channel Shannon s second theorem Information Theory 1/33 Outline 1. Memoryless channels, examples ; 2. Capacity ; 3. Symmetric channels ; 4. Channel Coding ; 5. Shannon s second theorem,

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata Principles of Pattern Recognition C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata e-mail: murthy@isical.ac.in Pattern Recognition Measurement Space > Feature Space >Decision

More information

Math Review Sheet, Fall 2008

Math Review Sheet, Fall 2008 1 Descriptive Statistics Math 3070-5 Review Sheet, Fall 2008 First we need to know about the relationship among Population Samples Objects The distribution of the population can be given in one of the

More information

Batch Mode Sparse Active Learning. Lixin Shi, Yuhang Zhao Tsinghua University

Batch Mode Sparse Active Learning. Lixin Shi, Yuhang Zhao Tsinghua University Batch Mode Sparse Active Learning Lixin Shi, Yuhang Zhao Tsinghua University Our work Propose an unified framework of batch mode active learning Instantiate the framework using classifiers based on sparse

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

Sequential Supervised Learning

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of

More information

Feature Engineering. Knowledge Discovery and Data Mining 1. Roman Kern. ISDS, TU Graz

Feature Engineering. Knowledge Discovery and Data Mining 1. Roman Kern. ISDS, TU Graz Feature Engineering Knowledge Discovery and Data Mining 1 Roman Kern ISDS, TU Graz 2017-11-09 Roman Kern (ISDS, TU Graz) Feature Engineering 2017-11-09 1 / 66 Big picture: KDDM Probability Theory Linear

More information

Learning Bayesian belief networks

Learning Bayesian belief networks Lecture 4 Learning Bayesian belief networks Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administration Midterm: Monday, March 7, 2003 In class Closed book Material covered by Wednesday, March

More information

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9: Useful for modeling/making predictions on sequential data E.g., biological sequences, text, series of sounds/spoken words Will return to graphical models that are generative sscott@cse.unl.edu 1 / 27 2

More information

Why Language Models and Inverse Document Frequency for Information Retrieval?

Why Language Models and Inverse Document Frequency for Information Retrieval? Why Language Models and Inverse Document Frequency for Information Retrieval? Catarina Moreira, Andreas Wichert Instituto Superior Técnico, INESC-ID Av. Professor Cavaco Silva, 2744-016 Porto Salvo, Portugal

More information

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

CS446: Machine Learning Fall Final Exam. December 6 th, 2016 CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains

More information

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012) Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation

More information

Conditional Independence and Factorization

Conditional Independence and Factorization Conditional Independence and Factorization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Linear Classification

Linear Classification Linear Classification Lili MOU moull12@sei.pku.edu.cn http://sei.pku.edu.cn/ moull12 23 April 2015 Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative

More information

HYPOTHESIS TESTING AND FEATURE SELECTION IN SEMI-SUPERVISED DATA

HYPOTHESIS TESTING AND FEATURE SELECTION IN SEMI-SUPERVISED DATA HYPOTHESIS TESTING AND FEATURE SELECTION IN SEMI-SUPERVISED DATA A thesis submitted to the University of Manchester for the degree of Doctor of Philosophy in the Faculty of Engineering and Physical Sciences

More information