X={x ij } φ ik =φ k (x i ) Φ={φ ik } Lecture 11: Information Theoretic Methods. Mutual Information as Information Gain. Feature Transforms
|
|
- Denis Norton
- 5 years ago
- Views:
Transcription
1 Lecture 11: Information Theoretic Methods Isabelle Guyon Mutual Information as Information Gain Book Chapter 6 and Feature Transforms Information, Entropy m x i n ={x ij } φ ik =φ k (x i ) For an event happening with probability p, the quantity of information is log p. If the log is of base 2, the unit is a bit. n (n <n) The average quantity of information over all events is the entropy: m Φ={φ ik } m H = - Σ p log p 1
2 Conditional Entropy Mutual Information Our case: =class labels, Φ feature representation. Entropy: H() = - Σ y p(y) log p(y) Conditional entropy: Information Gain: The amount of class uncertainty reduction by observing Φ is: MI(, Φ) = H() H( Φ) H( Φ)= - f p(f) Σ y p(y f) log p(y f) df MI(, Φ) = Σ y f p(y, f) log p(y, f) p(y) p(f) df Mutual Information MI Estimation H( ) H(, ) MI(,) H( ) P() P()*P() P(, ) pxy=[4/16, 6/16; 5/16, 1/16] px=sum(pxy) py=sum(pxy, 2) pxpy=(py*px) MI=sum(sum(pxy.*log2(pxy./pxpy))) pxy = px = H() H() H(,) = H() + H() MI(,) MI(,) = H()-H( ) = H()-H( ) P() Possible normalization by (H()+H())/2 py = pxpy = MI =
3 Tree Classifiers CART (Breiman, 1984) or C4.5 (Quinlan, 1993) Information Gain H before = -11/19 log(11/19) -8/19 log(8/19) = 0.98 f 2 All the data f 2 All the data Choose f 1 Choose f 2 f 1 At each step, choose the feature that reduces entropy most. Work towards node purity. Choose f 2 11/19 8/19 H left = -4/11 log(4/11) 7/11 log(7/11) = 0.94 f 1 H right = -7/8 log(7/8) 1/8 log(1/8) = 0.54 IG = H before (11/19 H left + 8/19 H right ) = = 0.2 IG = MI Iris Data (Fisher, 1936/Chapter 1) Linear discriminant Tree classifier H before = H() H left = H( f 2 =-1) H right = H( f 2 =1) IG = H before (p(f 2 =-1) H left + p(f 2 =1) H right ) = H() (p(f 2 =-1) H( f 2 =-1) + p(f 2 =1) H( f 2 =1)) = H() H( f 2 ) = MI(,f 2 ) setosa virginica versicolor Gaussian mixture Kernel method (SVM) 3
4 MI for Feature Selection Forward Selection with MI Random Forest (Breiman, Cutler, 2002) Feature ranking Forward selection Feature subset selection (Markov Blankets, scaling factors, shrinkage) Fleuret, Practical only for binary features. Select a first feature?(1) with maximum MI with the target. For each remaining feature i and each previously selected feature?(j), compute the conditional mutual information: CMI( i,?(j) )= Σ?(j) P(?(j) ) MI( i,?(j) ) Select the feature with maximum CMI. Transmission Channels Transmission Channels Noisy channel Channel capacity = max p(x) MI(, ) = (second Shannon theorem) highest achievable information transmission rate without error. 4
5 Coding MI and Error Rate Encoder Noisy channel Fano, 1961 Code efficiency = length(code) / length(original signal) Theorems (Shannon, 1948): Errorless transmission is achievable with sufficiently inefficient (well chosen) codes; the efficiency is bounded by the capacity. For a chosen error rate, one can find a shorter code. Hellman and Raviv (1970) Feder and Merhav (1994) Noisy Channel in PR Noisy Channel in PR This is a four 4 Category Encoder Ideal pattern Noisy channel Real pattern This is a four 4 Encoder Category Ideal pattern Noisy channel Real pattern Decoder Predicted category Preprocessor Features Φ Decoder Predicted category 5
6 Information Bottleneck Tishby et al., 1999 Φ Information Bottleneck Methods maximize MI(, Φ) = MI(Φ,) but minimize MI(, Φ) Gradient Descent Computational Tricks (1/4) Class labels: y High dim. data: x f(q, x) Low dim. Feature vector: f Mutual Information MI(f, y) Feature x construction/ f Posterior p(y f) probability (dim n) selection (dim n ) estimation (dim 1) f(q, x) 1 (1, n θ ) (1, n ) (n, n θ ) MI/ q = MI/ f p(y f) appears here f/ q Gradient: MI/ q Torkkola, 2001 (1, n ) (n, n) (n, 1) For example f(w, x) = W x T (linear model) f/ w=x 6
7 Computational Tricks (2/4) Computational Tricks (3/4) MI(, Φ) = Σ y φ p(y, f) log p(y, f) p(y) p(f) df 2 MI(, Φ) = Σ y φ (p(y, f) - p(y) p(f)) 2 df Justification: Renyi entropy H() = - log Σ y p(y) 2 3 Use densities that are mixtures of Gaussians. Examples of Suitable Densities Parzen Windows (Rosenblatt 1956, Parzen 1962) Gaussian mixture: p(f y) = G(f- f y, Σ y ) p(f) = Σ y p(y) p(f y) =Σ y (m y /m) G(f- f y, Σ y ) p(x) = (1/m)Σ i=1:m G(f- f i, σ m ) Parzen windows: one per class σ m = σ/sqrt(m) p(f y) = (1/m y ) Σ c y G(f- f c, σ) p(f y) = (1/m) Σ i G(f- f i, σ) the same for all one per example From Duda and Hart,
8 Computational Tricks (4/4) Plugging In MI(, Φ) = Σ y φ (p(y, f) - p(y) p(f)) 2 df 4 MI(, Φ) = V in + V all - V bw V in = function ( G(f ci - f cj, 2σ) ) MI(, Φ) = Σ y φ p(y, f) 2 df V in V all = function ( G(f i - f j, 2σ) ) + Σ y φ p(y) 2 p(f) 2 df V all V bw = function ( G(f ci - f j, 2σ) ) - 2 Σ y φ p(y, f) p(y) p(f) df V bw Information Forces Max MI by hill climbing (gradient descent): q(t+1) = q(t) + η MI/ q = q(t) + η MI/ f f/ q = q(t) + η [ V in / f + V all / f - V bw / f] f/ q Force = - gradient (Energy) Energy - MI V in / f Cohesion within classes V all / f Overall cohesion - V bw / f Repulsion to other classes Feature Construction Torkkola, 2001 Example 1:Three classes in 3-dimensional space. Artificial data. 1.a f = linear, p(y f) = Parzen windows 1.b f = RBF, p(y f) = Parzen windows Example 2:Three classes in 12-dimensional space. This is the oil-pipeflowdata from NCRG at Aston University. 2.a f = linear, p(y f) = Parzen windows 2.b f = linear, p(y f) = Gaussian mixture Example 3: Six classes in 36-dimensional space. This is the Landsat satellite image database from Statlogproject. f = linear, p(y f) = Parzen windows 8
9 Application to Feature Selection Φ i = σ i x i scaling factors, σ i 0 s * = argmax s MI(Φ, ) subject to: Σ i σ 2 i = 1 (shear) s * = argmin s MI(, Φ) - β MI(Φ, ) Conclusion MI characterizes the dependency between random variables: Feature ranking Forward selection with CMI MI is also an Information Gain Tree classifiers Random Forests MI characterizes the channel capacity and the rate of distorsion Information bottleneck methods min q MI(, Φ) - β MI(Φ, ) Madelon (continued) Exercise Class Madelon How to review a paper? Exercises on IT methods my_classif=svc({'coef0=1', 'degree=0', 'gamma=1', 'shrinkage=1'}); my_model=chain({probe(relief,{'p_num=2000', 'pval_max=0'}), standardize, my_classif}) f_max=20' Earn 1 point with a result better than the baseline method (testber<7.33 or [testber=7.33 and feat_num<20 (4%)]). Earn a total of 2 points with a result testber<6.79% or [testber=6.79% and feat_num<17 (3.4%)]. 9
10 Homework Review form Write a review of a paper in the Feature Extraction book (the paper you presented or will present or any paper/chapter you want). Questions: What is the goal of a review? What is the goal of publishing? A. Eliminating criteria (a zero means rejection) 1. Scope 2. Novelty 3. Usefulness 4. Sanity B. Other fundamental criteria (a low grade implies major revisions) 1. Quantity 2. Reproducibility 3. Demonstration 4. Comparison Review Form (continued) Review Form (end) C. Forgivable problems of contents (minor revisions may be required) 1. Completeness 2. Take-aways 3. Bibliography 4. Outlook D. Bonus reproducibility criteria (give additional credit) 1. Data availability 2. Code availability E. Message delivery (major/minor revisions may be required) 1. Readability 2. Notations 3. Figures 4. Formalism 5. Density 6. Language 10
11 MI and Pearson Correlation Exercise: prove MI=-1/2 log(1-r 2 ) P() MI(, ) = -(1/2) log(1-r 2 ) P() Prove: Entropy Gaussian source: H() = ½ log(2πe)σ 2 Solution of one-variable least-square regression y=ax: a=σ xy /σ x 2. We will call a the sol. of x=a y. Pearson correlation coefficient: R 2 =aa Residual error σ 2 =E(y-ax) 2 =(1-R 2 )σ y 2. =function()+n, N is random noise MI(,)=H()-H(N) MI(,) H()-H(N) For a Gaussian channel with Gaussian source: MI(,)=-½ log σ 2 /σ y 2. Conclude the proof. Markov Blankets Exercise (Markov Blankets) Notations: All features, Features selected Φ, complement Φ, =[Φ, Φ ]. Definition: A Markov Blanket is a subset of features Φ such that: P(, Φ) = P( Φ) P( Φ) We write - Φ ( is independent of given Φ) If Φ is a MB, prove P( Φ) = P( ) for all assigments of values to Φ. We define DMI(Φ) = E p() D KL (P( ) P( Φ)) Prove: DMI(Φ)=E p(,) P( )/P( Φ) We define A=log[P(,)/(P()P())] and B=log [P( Φ,)/(P( Φ)P())], prove: A-B=P( )/P( Φ) Σ {Φ} P( Φ,)B= Σ {} P(,)B DMI(Φ) = MI(,) MI(Φ,) Define an optimization problem to find a minimum MB. How can you relate it to the information bottleneck approach? 11
12 Complement of exercise What are the similarities and differences between the bottleneck IT methods and the Bayesian/regularized risk methods? Can you suggest hybrid approaches? Can you suggest forward selection and backward elimination IT methods using the scaling factor approach? 12
Variable selection and feature construction using methods related to information theory
Outline Variable selection and feature construction using methods related to information theory Kari 1 1 Intelligent Systems Lab, Motorola, Tempe, AZ IJCNN 2007 Outline Outline 1 Information Theory and
More informationReferences. Lecture 7: Support Vector Machines. Optimum Margin Perceptron. Perceptron Learning Rule
References Lecture 7: Support Vector Machines Isabelle Guyon guyoni@inf.ethz.ch An training algorithm for optimal margin classifiers Boser-Guyon-Vapnik, COLT, 992 http://www.clopinet.com/isabelle/p apers/colt92.ps.z
More informationCS 630 Basic Probability and Information Theory. Tim Campbell
CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability Theory Probability Theory is the study of how best to predict outcomes of events. An experiment (or trial or event)
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationPATTERN CLASSIFICATION
PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS
More informationReferences. Lecture 3: Shrinkage. The Power of Amnesia. Ockham s Razor
References Lecture 3: Shrinkage Isabelle Guon guoni@inf.ethz.ch Structural risk minimization for character recognition Isabelle Guon et al. http://clopinet.com/isabelle/papers/sr m.ps.z Kernel Ridge Regression
More informationIntroduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.
L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate
More informationDept. of Linguistics, Indiana University Fall 2015
L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission
More informationBayesian Support Vector Machines for Feature Ranking and Selection
Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction
More informationInformation Theory and Feature Selection (Joint Informativeness and Tractability)
Information Theory and Feature Selection (Joint Informativeness and Tractability) Leonidas Lefakis Zalando Research Labs 1 / 66 Dimensionality Reduction Feature Construction Construction X 1,..., X D f
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationLecture 7: DecisionTrees
Lecture 7: DecisionTrees What are decision trees? Brief interlude on information theory Decision tree construction Overfitting avoidance Regression trees COMP-652, Lecture 7 - September 28, 2009 1 Recall:
More informationInformation Theory Primer:
Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen s inequality Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,
More informationLecture 9: PGM Learning
13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and
More informationPart I Week 7 Based in part on slides from textbook, slides of Susan Holmes
Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes Support Vector Machine, Random Forests, Boosting December 2, 2012 1 / 1 2 / 1 Neural networks Artificial Neural networks: Networks
More informationDecision Tree Learning Lecture 2
Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over
More informationECE662: Pattern Recognition and Decision Making Processes: HW TWO
ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are
More informationECE598: Information-theoretic methods in high-dimensional statistics Spring 2016
ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture : Mutual Information Method Lecturer: Yihong Wu Scribe: Jaeho Lee, Mar, 06 Ed. Mar 9 Quick review: Assouad s lemma
More informationInformation Theory, Statistics, and Decision Trees
Information Theory, Statistics, and Decision Trees Léon Bottou COS 424 4/6/2010 Summary 1. Basic information theory. 2. Decision trees. 3. Information theory and statistics. Léon Bottou 2/31 COS 424 4/6/2010
More informationLearning Decision Trees
Learning Decision Trees Machine Learning Fall 2018 Some slides from Tom Mitchell, Dan Roth and others 1 Key issues in machine learning Modeling How to formulate your problem as a machine learning problem?
More informationLecture 1: Introduction, Entropy and ML estimation
0-704: Information Processing and Learning Spring 202 Lecture : Introduction, Entropy and ML estimation Lecturer: Aarti Singh Scribes: Min Xu Disclaimer: These notes have not been subjected to the usual
More informationEE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018
Please submit the solutions on Gradescope. EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018 1. Optimal codeword lengths. Although the codeword lengths of an optimal variable length code
More informationLecture 4 Channel Coding
Capacity and the Weak Converse Lecture 4 Coding I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw October 15, 2014 1 / 16 I-Hsiang Wang NIT Lecture 4 Capacity
More informationInformation Theory - Entropy. Figure 3
Concept of Information Information Theory - Entropy Figure 3 A typical binary coded digital communication system is shown in Figure 3. What is involved in the transmission of information? - The system
More informationLecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon
Lecture 2: Enseble Methods Isabelle Guyon guyoni@inf.ethz.ch Introduction Book Chapter 7 Weighted Majority Mixture of Experts/Coittee Assue K experts f, f 2, f K (base learners) x f (x) Each expert akes
More informationFinal Exam, Machine Learning, Spring 2009
Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationLimits on classical communication from quantum entropy power inequalities
Limits on classical communication from quantum entropy power inequalities Graeme Smith, IBM Research (joint work with Robert Koenig) QIP 2013 Beijing Channel Capacity X N Y p(y x) Capacity: bits per channel
More informationLecture 5 Channel Coding over Continuous Channels
Lecture 5 Channel Coding over Continuous Channels I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw November 14, 2014 1 / 34 I-Hsiang Wang NIT Lecture 5 From
More informationQuiz 2 Date: Monday, November 21, 2016
10-704 Information Processing and Learning Fall 2016 Quiz 2 Date: Monday, November 21, 2016 Name: Andrew ID: Department: Guidelines: 1. PLEASE DO NOT TURN THIS PAGE UNTIL INSTRUCTED. 2. Write your name,
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More informationMachine Learning for Signal Processing Bayes Classification and Regression
Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For
More informationELEMENT OF INFORMATION THEORY
History Table of Content ELEMENT OF INFORMATION THEORY O. Le Meur olemeur@irisa.fr Univ. of Rennes 1 http://www.irisa.fr/temics/staff/lemeur/ October 2010 1 History Table of Content VERSION: 2009-2010:
More informationLearning Decision Trees
Learning Decision Trees Machine Learning Spring 2018 1 This lecture: Learning Decision Trees 1. Representation: What are decision trees? 2. Algorithm: Learning decision trees The ID3 algorithm: A greedy
More informationLecture 14 February 28
EE/Stats 376A: Information Theory Winter 07 Lecture 4 February 8 Lecturer: David Tse Scribe: Sagnik M, Vivek B 4 Outline Gaussian channel and capacity Information measures for continuous random variables
More informationMachine learning - HT Maximum Likelihood
Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationShannon s noisy-channel theorem
Shannon s noisy-channel theorem Information theory Amon Elders Korteweg de Vries Institute for Mathematics University of Amsterdam. Tuesday, 26th of Januari Amon Elders (Korteweg de Vries Institute for
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationMachine Learning, Fall 2012 Homework 2
0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationELEC546 Review of Information Theory
ELEC546 Review of Information Theory Vincent Lau 1/1/004 1 Review of Information Theory Entropy: Measure of uncertainty of a random variable X. The entropy of X, H(X), is given by: If X is a discrete random
More informationNoisy channel communication
Information Theory http://www.inf.ed.ac.uk/teaching/courses/it/ Week 6 Communication channels and Information Some notes on the noisy channel setup: Iain Murray, 2012 School of Informatics, University
More informationModule 1. Introduction to Digital Communications and Information Theory. Version 2 ECE IIT, Kharagpur
Module ntroduction to Digital Communications and nformation Theory Lesson 3 nformation Theoretic Approach to Digital Communications After reading this lesson, you will learn about Scope of nformation Theory
More informationUndirected Graphical Models: Markov Random Fields
Undirected Graphical Models: Markov Random Fields 40-956 Advanced Topics in AI: Probabilistic Graphical Models Sharif University of Technology Soleymani Spring 2015 Markov Random Field Structure: undirected
More informationLecture 21: Minimax Theory
Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways
More informationLecture 7. Union bound for reducing M-ary to binary hypothesis testing
Lecture 7 Agenda for the lecture M-ary hypothesis testing and the MAP rule Union bound for reducing M-ary to binary hypothesis testing Introduction of the channel coding problem 7.1 M-ary hypothesis testing
More information2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University
2018 CS420, Machine Learning, Lecture 5 Tree Models Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html ML Task: Function Approximation Problem setting
More informationExpectation Maximization
Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /
More informationEE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm
EE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm 1. Feedback does not increase the capacity. Consider a channel with feedback. We assume that all the recieved outputs are sent back immediately
More informationNAME... Soc. Sec. #... Remote Location... (if on campus write campus) FINAL EXAM EE568 KUMAR. Sp ' 00
NAME... Soc. Sec. #... Remote Location... (if on campus write campus) FINAL EXAM EE568 KUMAR Sp ' 00 May 3 OPEN BOOK exam (students are permitted to bring in textbooks, handwritten notes, lecture notes
More informationCS540 ANSWER SHEET
CS540 ANSWER SHEET Name Email 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 1 2 Final Examination CS540-1: Introduction to Artificial Intelligence Fall 2016 20 questions, 5 points
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More information9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering
Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make
More informationPart I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis
Week 5 Based in part on slides from textbook, slides of Susan Holmes Part I Linear Discriminant Analysis October 29, 2012 1 / 1 2 / 1 Nearest centroid rule Suppose we break down our data matrix as by the
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Decision Trees Tobias Scheffer Decision Trees One of many applications: credit risk Employed longer than 3 months Positive credit
More informationChaos, Complexity, and Inference (36-462)
Chaos, Complexity, and Inference (36-462) Lecture 7: Information Theory Cosma Shalizi 3 February 2009 Entropy and Information Measuring randomness and dependence in bits The connection to statistics Long-run
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationInformation Theory. Coding and Information Theory. Information Theory Textbooks. Entropy
Coding and Information Theory Chris Williams, School of Informatics, University of Edinburgh Overview What is information theory? Entropy Coding Information Theory Shannon (1948): Information theory is
More informationLecture 11: Polar codes construction
15-859: Information Theory and Applications in TCS CMU: Spring 2013 Lecturer: Venkatesan Guruswami Lecture 11: Polar codes construction February 26, 2013 Scribe: Dan Stahlke 1 Polar codes: recap of last
More informationp(d θ ) l(θ ) 1.2 x x x
p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to
More informationDATA MINING LECTURE 9. Minimum Description Length Information Theory Co-Clustering
DATA MINING LECTURE 9 Minimum Description Length Information Theory Co-Clustering MINIMUM DESCRIPTION LENGTH Occam s razor Most data mining tasks can be described as creating a model for the data E.g.,
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationCPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017
CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class
More informationFeature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with
Feature selection c Victor Kitov v.v.kitov@yandex.ru Summer school on Machine Learning in High Energy Physics in partnership with August 2015 1/38 Feature selection Feature selection is a process of selecting
More informationLinear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining
Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes November 9, 2012 1 / 1 Nearest centroid rule Suppose we break down our data matrix as by the labels yielding (X
More informationFINAL: CS 6375 (Machine Learning) Fall 2014
FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for
More informationMachine Learning 2017
Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More information+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing
Linear recurrent networks Simpler, much more amenable to analytic treatment E.g. by choosing + ( + ) = Firing rates can be negative Approximates dynamics around fixed point Approximation often reasonable
More informationEntropies & Information Theory
Entropies & Information Theory LECTURE I Nilanjana Datta University of Cambridge,U.K. See lecture notes on: http://www.qi.damtp.cam.ac.uk/node/223 Quantum Information Theory Born out of Classical Information
More informationLecture Notes for Statistics 311/Electrical Engineering 377. John Duchi
Lecture Notes for Statistics 311/Electrical Engineering 377 March 13, 019 Contents 1 Introduction and setting 6 1.1 Information theory..................................... 6 1. Moving to statistics....................................
More informationMAHALAKSHMI ENGINEERING COLLEGE-TRICHY QUESTION BANK UNIT V PART-A. 1. What is binary symmetric channel (AUC DEC 2006)
MAHALAKSHMI ENGINEERING COLLEGE-TRICHY QUESTION BANK SATELLITE COMMUNICATION DEPT./SEM.:ECE/VIII UNIT V PART-A 1. What is binary symmetric channel (AUC DEC 2006) 2. Define information rate? (AUC DEC 2007)
More informationChannel capacity. Outline : 1. Source entropy 2. Discrete memoryless channel 3. Mutual information 4. Channel capacity 5.
Channel capacity Outline : 1. Source entropy 2. Discrete memoryless channel 3. Mutual information 4. Channel capacity 5. Exercices Exercise session 11 : Channel capacity 1 1. Source entropy Given X a memoryless
More informationMachine Learning Lecture 2
Machine Perceptual Learning and Sensory Summer Augmented 15 Computing Many slides adapted from B. Schiele Machine Learning Lecture 2 Probability Density Estimation 16.04.2015 Bastian Leibe RWTH Aachen
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationConvex Optimization M2
Convex Optimization M2 Lecture 8 A. d Aspremont. Convex Optimization M2. 1/57 Applications A. d Aspremont. Convex Optimization M2. 2/57 Outline Geometrical problems Approximation problems Combinatorial
More informationLecture 2: August 31
0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy
More informationKernel Methods & Support Vector Machines
Kernel Methods & Support Vector Machines Mahdi pakdaman Naeini PhD Candidate, University of Tehran Senior Researcher, TOSAN Intelligent Data Miners Outline Motivation Introduction to pattern recognition
More informationChapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye
Chapter 2: Entropy and Mutual Information Chapter 2 outline Definitions Entropy Joint entropy, conditional entropy Relative entropy, mutual information Chain rules Jensen s inequality Log-sum inequality
More informationLecture 3: Pattern Classification
EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures
More information9 Forward-backward algorithm, sum-product on factor graphs
Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 9 Forward-backward algorithm, sum-product on factor graphs The previous
More informationDan Roth 461C, 3401 Walnut
CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn
More informationInformation & Correlation
Information & Correlation Jilles Vreeken 11 June 2014 (TADA) Questions of the day What is information? How can we measure correlation? and what do talking drums have to do with this? Bits and Pieces What
More informationMultivariate statistical methods and data mining in particle physics
Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general
More informationLecture 5: Classification
Lecture 5: Classification Advanced Applied Multivariate Analysis STAT 2221, Spring 2015 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department of Mathematical Sciences Binghamton
More informationGenerative and Discriminative Approaches to Graphical Models CMSC Topics in AI
Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2 Yasemin Altun January 26, 2007 Review of Inference on Graphical Models Elimination algorithm finds single
More informationEE376A: Homeworks #4 Solutions Due on Thursday, February 22, 2018 Please submit on Gradescope. Start every question on a new page.
EE376A: Homeworks #4 Solutions Due on Thursday, February 22, 28 Please submit on Gradescope. Start every question on a new page.. Maximum Differential Entropy (a) Show that among all distributions supported
More informationProbabilistic modeling. The slides are closely adapted from Subhransu Maji s slides
Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework
More informationINTRODUCTION TO INFORMATION THEORY
INTRODUCTION TO INFORMATION THEORY KRISTOFFER P. NIMARK These notes introduce the machinery of information theory which is a eld within applied mathematics. The material can be found in most textbooks
More informationChapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code
Chapter 3 Source Coding 3. An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code 3. An Introduction to Source Coding Entropy (in bits per symbol) implies in average
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationComputing and Communications 2. Information Theory -Entropy
1896 1920 1987 2006 Computing and Communications 2. Information Theory -Entropy Ying Cui Department of Electronic Engineering Shanghai Jiao Tong University, China 2017, Autumn 1 Outline Entropy Joint entropy
More informationMining Classification Knowledge
Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology SE lecture revision 2013 Outline 1. Bayesian classification
More information