Information, Learning and Falsification
|
|
- Jayson Gordon
- 5 years ago
- Views:
Transcription
1 Information, Learning and Falsification David Balduzzi December 17, 2011 Max Planck Institute for Intelligent Systems Tübingen, Germany
2 Three main theories of information: Algorithmic information. Description. The information embedded in a single string depends on its shortest description.
3 Three main theories of information: Algorithmic information. Description. The information embedded in a single string depends on its shortest description. Shannon information. Transmission. The information transmitted by symbols depends on the transmission probabilities of other symbols in an ensemble.
4 Three main theories of information: Algorithmic information. Description. The information embedded in a single string depends on its shortest description. Shannon information. Transmission. The information transmitted by symbols depends on the transmission probabilities of other symbols in an ensemble. Statistical learning theory. Prediction. The information about the world embedded in a classifier (its expected error) depends on the complexity of the learning algorithm.
5 Three main theories of information: Algorithmic information. Description. The information embedded in a single string depends on its shortest description. Shannon information. Transmission. The information transmitted by symbols depends on the transmission probabilities of other symbols in an ensemble. Statistical learning theory. Prediction. The information about the world embedded in a classifier (its expected error) depends on the complexity of the learning algorithm. Can these be related?
6 Three main theories of information: Algorithmic information. Description. The information embedded in a single string depends on its shortest description. Shannon information. Transmission. The information transmitted by symbols depends on the transmission probabilities of other symbols in an ensemble. Statistical learning theory. Prediction. The information about the world embedded in a classifier (its expected error) depends on the complexity of the learning algorithm. Can these be related? Effective information. Discrimination. The information produced by a physical process when it produces an output depends on how sharply it discriminates between inputs.
7 Effective information
8 Nature decomposes into specific, bounded physical systems which we model as deterministic functions f : X Y or more generally as Markov matrices p m (y x), where X and Y are finite sets.
9 Physical processes discriminate between inputs thermometer
10 Definition The discrimination given by Markov matrix m outputting y is ( ) ) ˆp m x y := pm (y do(x) punif (x) p m (y), where p m (y) := x p m( y do(x) ) punif (x) is the effective distribution. Definition Effective information is the Kullback-Leibler divergence [ ( ) ] ei(m, y) := H ˆp m X y punif (X ) Balduzzi and Tononi, PLoS Computational Biology, 2008
11 Special case: deterministic f : X Y Definition The discrimination given by f outputting y assigns equal probability to all elements of pre-image f 1 (y). Definition Effective information is ei(f, y) := log f 1 (y) X
12 [ discrimination thermometer input when thermometer outputs is [ ei = -log size size
13 Algorithmic information
14 Definition Given universal prefix Turing machine T, the Kolmogorov complexity of string s is K T (s) := min len(i) {i:t (i)=s } the length of the shortest program that generates s. For any Turing machine U T, there exists a constant c such that K U (s) c K T (s) K U (s) + c for all s.
15 Definition Given T, the (unnormalized) Solomonoff prior probability of string s is p T (s) := 2 len(i), {i T (i)=s } where the sum is over strings i that cause T to output s as a prefix, and no proper prefix of i outputs s. The Turing machine discriminates between programs according to which strings they output; Solomonoff prior counts programs are in each class (weighted by length).
16 Kolmogorov complexity = Algorithmic probability Theorem (Levin) For all s log P T (s) = K T (s). up to an additive constant c. Upshot: for my purposes, Solomonoff s formulation of Kolmogorov complexity is the right one K T (s) := log p T (s).
17 Recall, the effective distribution was the denominator when computing discriminations using Bayes rule: ˆp m ( x y ) := pm (y do(x) ) punif (x) p m (y).
18 Solomonoff prior Effective distribution Proposition The effective distribution on Y induced by f is p f (y) = 2 len(x) {x:f (x)=y} Compare with Solomonoff distribution: p T (s) := Compute effective distribution by {i T (i)=s } 2 len(i) replacing universal Turing machine T with f : X Y ; and giving inputs len(x) = log X in the optimal code for the uniform distribution on X.
19 Kolmogorov Complexity Effective information Proposition For function f : X Y, effective information equals ei(f, y) = log p f (y) = log {x:f (x)=y} 2 len(x) Compare with Kolmogorov complexity: K T (s) = log p T (s) = log {i T (i)=s } 2 len(i)
20 Statistical learning theory
21 Hypothesis space Given unlabeled { data} D = (x 1,..., x l ) X l, let hypothesis space Σ D = σ : D ±1 be the set of all possible labelings HYPOTHESIS SPACE
22 Setup Suppose data D = (x 1,..., x l ) is drawn from unknown probability distribution P X and labeled y i = σ(x i ) by an unknown supervisor σ Σ X. The learning problem: Find a classifier ˆf guaranteed to perform well on future (unseen) data sampled via P X and labeled by σ.
23 Empirical risk minimization Suppose we are given a class F of functions to work with. A simple algorithm for tackling the learning problem is: Algorithm: Given data labeled by σ Σ D, find classifier ˆf F Σ D, that minimizes empirical risk: 1 ˆf := arg min f F l l i=1 I f (xi ) σ(x i )
24 Empirical risk minimization Suppose we are given a class F of functions to work with. A simple algorithm for tackling the learning problem is: Algorithm: Given data labeled by σ Σ D, find classifier ˆf F Σ D, that minimizes empirical risk: 1 ˆf := arg min f F l l i=1 I f (xi ) σ(x i ) Key step. Reformulate algorithm as function between finite sets:
25 Empirical risk minimization Suppose we are given a class F of functions to work with. A simple algorithm for tackling the learning problem is: Algorithm: Given data labeled by σ Σ D, find classifier ˆf F Σ D, that minimizes empirical risk: 1 ˆf := arg min f F l l i=1 I f (xi ) σ(x i ) Key step. Reformulate algorithm as function between finite sets: Empirical risk minimization: R F,D : HYPOTHESIS SPACE EMPIRICAL RISK Σ D R σ min f F 1 l l i=1 I f (x i ) σ(x i )
26 LOW CAPACITY fits few hypotheses HIGH CAPACITY fits many hypotheses F 1 F 2 HYPOTHESIS SPACE EMPIRICAL RISK R 1 R MINIMIZER 2 0 ε 1 ε 2 ε 3 TRAINING ERROR 0 ε 1 ε 2 ε 3
27 Theorem (standard template for error bounds in SLT) With probability 1 δ, ( ) ( ) expected error historical + of learner error ( capacity of algorithm ) + ( ) confidence term UNDERFITTING OVERFITTING
28 Minimizing empirical risk R F,D : Σ X R is a physical process. Questions: Q1. What is the effective distribution ( Solomonoff prior ) of the ERM? Q2. What is the effective information ( Kolmogorov complexity ) of its outputs?
29 Effective distribution Rademacher complexity ε 3 R ε 2 ε 1 0 Proposition ( Solomonoff Rademacher ) The expectation of the ERM over the effective distribution is empirical Rademacher complexity: ɛ p RF,D (ɛ) = 1 ( ) 1 Rademacher(F, D) 2 ɛ R
30 Effective information VC-entropy ε 3 R ε 2 ε 1 0 Proposition ( Kolmogorov Vapnik ) The effective information generated by the ERM when it outputs 0 is empirical VC-entropy: ei(r F,D, 0) = log p RF,D (0) = l VC-entropy(F, D), where l is amount of training data.
31 Corollary (reformulation of error bounds in SLT) With probability 1 δ, ( ) ( ) ( ) ( ) expected historical discrimination of confidence + + output ERM output ERM inputs by ERM term HYPOTHESES ERRORS ε 3 ERM ε 2 ε 1 0 how ERM discriminates inputs what ERM outputs
32 Falsification
33 Karl Popper wanted to justify scientific knowledge. He was very impressed by Einstein s bold conjecture about the Sun s gravitational field bending starlight which, when proved correct, overthrew Newtonian physics despite an enormous body of evidence in favor of Newton.
34 Karl Popper wanted to justify scientific knowledge. He was very impressed by Einstein s bold conjecture about the Sun s gravitational field bending starlight which, when proved correct, overthrew Newtonian physics despite an enormous body of evidence in favor of Newton. Popper s big idea: Rely on theories that have been severely tested, rather than theories supported by lots of facts. Unfortunately, Popper failed to justify his big idea.
35 Counting falsified hypotheses. Rademacher complexity. ɛ prf,d 2( (ɛ) = 1 ) 1 Rademacher(F, D) ε 3 R ε 2 ε 1 0 ( fraction of hypothesis ERM falsifies p RF,D (ɛ) ɛ = ɛ R ɛ ( ) = weighted count of falsified hypotheses ) ( ) on fraction ɛ of data
36 Counting falsified hypotheses. VC-entropy. ei(r F,D, 0) = l VC-entropy(F, D) ε 3 R ε 2 ε 1 0 ei(r F,D, 0) = log Σ X }{{} = log R 1 F,D (0) }{{} total # hypotheses # hypotheses ERM fits ( logarithmic count of falsified hypotheses. )
37 Back to Popper and justifying scientific knowledge. Minimal model of Popper s question: When can we trust generalizations based on training error? Answer: If empirical risk minimizer has small capacity
38 Back to Popper and justifying scientific knowledge. Minimal model of Popper s question: When can we trust generalizations based on training error? Answer: If empirical risk minimizer has small capacity ERM has small capacity ERM falsifies many hypotheses.
39 Conclusion
40 Philosophy A major theme of 20 th century mathematics was transition from set theory (language for talking about points = elements) to category theory (language for talking about arrows = functions).
41 Philosophy A major theme of 20 th century mathematics was transition from set theory (language for talking about points = elements) to category theory (language for talking about arrows = functions). This talk: substituted thinking about sets (e.g. function class F Σ X ) with thinking about the structure of arrow ERM : Σ X R from hypothesis space to training errors Immediate consequences: 1 SLT algorithmic information theory 2 SLT falsification
42 Conclusion Physical processes discriminate between inputs Effective information is non-universal analog of Kolmogorov complexity universal Turing machine finite function Information generated while minimizing empirical risk 1 controls error bounds (SLT) and 2 in terms of number of falsified hypotheses. Conjecture: effective information generated by optimizations other than ERM also controls future performance.
43 Thank you!
Generalization bounds
Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question
More informationAn Introduction to No Free Lunch Theorems
February 2, 2012 Table of Contents Induction Learning without direct observation. Generalising from data. Modelling physical phenomena. The Problem of Induction David Hume (1748) How do we know an induced
More informationMachine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015
Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and
More informationCISC 876: Kolmogorov Complexity
March 27, 2007 Outline 1 Introduction 2 Definition Incompressibility and Randomness 3 Prefix Complexity Resource-Bounded K-Complexity 4 Incompressibility Method Gödel s Incompleteness Theorem 5 Outline
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationComputational Learning Theory
Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms
More informationIntroduction to Machine Learning
Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical
More informationSolving Classification Problems By Knowledge Sets
Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose
More information12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016
12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses
More informationFoundations of Machine Learning
Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about
More informationUniversal Learning Technology: Support Vector Machines
Special Issue on Information Utilizing Technologies for Value Creation Universal Learning Technology: Support Vector Machines By Vladimir VAPNIK* This paper describes the Support Vector Machine (SVM) technology,
More informationA first model of learning
A first model of learning Let s restrict our attention to binary classification our labels belong to (or ) We observe the data where each Suppose we are given an ensemble of possible hypotheses / classifiers
More informationModels of Language Acquisition: Part II
Models of Language Acquisition: Part II Matilde Marcolli CS101: Mathematical and Computational Linguistics Winter 2015 Probably Approximately Correct Model of Language Learning General setting of Statistical
More informationMachine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017
Machine Learning Model Selection and Validation Fabio Vandin November 7, 2017 1 Model Selection When we have to solve a machine learning task: there are different algorithms/classes algorithms have parameters
More informationAlgorithmic Probability
Algorithmic Probability From Scholarpedia From Scholarpedia, the free peer-reviewed encyclopedia p.19046 Curator: Marcus Hutter, Australian National University Curator: Shane Legg, Dalle Molle Institute
More informationIntroduction to Machine Learning (67577) Lecture 5
Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Nonuniform learning, MDL, SRM, Decision Trees, Nearest Neighbor Shai
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More informationEmpirical Risk Minimization, Model Selection, and Model Assessment
Empirical Risk Minimization, Model Selection, and Model Assessment CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 5.7-5.7.2.4, 6.5-6.5.3.1 Dietterich,
More informationAuthors: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira and Jennifer Wortman (University of Pennsylvania)
Learning Bouds for Domain Adaptation Authors: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira and Jennifer Wortman (University of Pennsylvania) Presentation by: Afshin Rostamizadeh (New York
More informationIntroduction and Models
CSE522, Winter 2011, Learning Theory Lecture 1 and 2-01/04/2011, 01/06/2011 Lecturer: Ofer Dekel Introduction and Models Scribe: Jessica Chang Machine learning algorithms have emerged as the dominant and
More informationAdvanced Introduction to Machine Learning CMU-10715
Advanced Introduction to Machine Learning CMU-10715 Risk Minimization Barnabás Póczos What have we seen so far? Several classification & regression algorithms seem to work fine on training datasets: Linear
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More information1 Review of The Learning Setting
COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we
More informationLecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;
CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and
More informationPAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University PAC Learning Matt Gormley Lecture 14 March 5, 2018 1 ML Big Picture Learning Paradigms:
More informationAn Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI
An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process
More informationLecture 3: Introduction to Complexity Regularization
ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric
More informationDoes Unlabeled Data Help?
Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline
More informationGeneralization, Overfitting, and Model Selection
Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How
More informationComputational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar
Computational Learning Theory: Probably Approximately Correct (PAC) Learning Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 This lecture: Computational Learning Theory The Theory
More informationCS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims
CS340 Machine learning Lecture 5 Learning theory cont'd Some slides are borrowed from Stuart Russell and Thorsten Joachims Inductive learning Simplest form: learn a function from examples f is the target
More informationFrom Complexity to Intelligence
From Complexity to Intelligence Machine Learning and Complexity PAGE 1 / 72 Table of contents Reminder Introduction to Machine Learning What is Machine Learning? Types of Learning Unsupervised Learning
More informationComputational Learning Theory
CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful
More informationThe PAC Learning Framework -II
The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline
More informationVC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.
VC Dimension Review The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces. Previously, in discussing PAC learning, we were trying to answer questions about
More information1 A Lower Bound on Sample Complexity
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #7 Scribe: Chee Wei Tan February 25, 2008 1 A Lower Bound on Sample Complexity In the last lecture, we stopped at the lower bound on
More informationhttp://imgs.xkcd.com/comics/electoral_precedent.png Statistical Learning Theory CS4780/5780 Machine Learning Fall 2012 Thorsten Joachims Cornell University Reading: Mitchell Chapter 7 (not 7.4.4 and 7.5)
More informationInformation Theory and Coding Techniques: Chapter 1.1. What is Information Theory? Why you should take this course?
Information Theory and Coding Techniques: Chapter 1.1 What is Information Theory? Why you should take this course? 1 What is Information Theory? Information Theory answers two fundamental questions in
More informationStatistical Learning Theory: Generalization Error Bounds
Statistical Learning Theory: Generalization Error Bounds CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 6.5.4 Schoelkopf/Smola Chapter 5 (beginning, rest
More informationStatistical and Computational Learning Theory
Statistical and Computational Learning Theory Fundamental Question: Predict Error Rates Given: Find: The space H of hypotheses The number and distribution of the training examples S The complexity of the
More informationAlgorithm-Independent Learning Issues
Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning
More informationTowards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert
Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert HSE Computer Science Colloquium September 6, 2016 IST Austria (Institute of Science and Technology
More informationStatistical Learning Theory and the C-Loss cost function
Statistical Learning Theory and the C-Loss cost function Jose Principe, Ph.D. Distinguished Professor ECE, BME Computational NeuroEngineering Laboratory and principe@cnel.ufl.edu Statistical Learning Theory
More informationGeneralization and Overfitting
Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle
More informationUnderstanding Machine Learning A theory Perspective
Understanding Machine Learning A theory Perspective Shai Ben-David University of Waterloo MLSS at MPI Tubingen, 2017 Disclaimer Warning. This talk is NOT about how cool machine learning is. I am sure you
More informationStatistical learning theory, Support vector machines, and Bioinformatics
1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.
More informationThe sample complexity of agnostic learning with deterministic labels
The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College
More informationMachine Learning 4771
Machine Learning 477 Instructor: Tony Jebara Topic 5 Generalization Guarantees VC-Dimension Nearest Neighbor Classification (infinite VC dimension) Structural Risk Minimization Support Vector Machines
More information6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL
6.867 Machine learning: lecture 2 Tommi S. Jaakkola MIT CSAIL tommi@csail.mit.edu Topics The learning problem hypothesis class, estimation algorithm loss and estimation criterion sampling, empirical and
More informationThe Learning Problem and Regularization
9.520 Class 02 February 2011 Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference problem from usually small sets of high dimensional, noisy data. Learning
More informationCS446: Machine Learning Spring Problem Set 4
CS446: Machine Learning Spring 2017 Problem Set 4 Handed Out: February 27 th, 2017 Due: March 11 th, 2017 Feel free to talk to other members of the class in doing the homework. I am more concerned that
More informationMachine Learning: Homework 5
0-60 Machine Learning: Homework 5 Due 5:0 p.m. Thursday, March, 06 TAs: Travis Dick and Han Zhao Instructions Late homework policy: Homework is worth full credit if submitted before the due date, half
More informationHypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell
Hypothesis Testing and Computational Learning Theory EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell Overview Hypothesis Testing: How do we know our learners are good? What does performance
More informationIFT Lecture 7 Elements of statistical learning theory
IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and
More informationCS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning
CS 446 Machine Learning Fall 206 Nov 0, 206 Bayesian Learning Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Bayesian Learning Naive Bayes Logistic Regression Bayesian Learning So far, we
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationMulticlass Multilabel Classification with More Classes than Examples
Multiclass Multilabel Classification with More Classes than Examples Ohad Shamir Weizmann Institute of Science Joint work with Ofer Dekel, MSR NIPS 2015 Extreme Classification Workshop Extreme Multiclass
More informationLecture 3: Empirical Risk Minimization
Lecture 3: Empirical Risk Minimization Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 3 1 / 11 A more general approach We saw the learning algorithms Memorize and
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationHands-On Learning Theory Fall 2016, Lecture 3
Hands-On Learning Theory Fall 016, Lecture 3 Jean Honorio jhonorio@purdue.edu 1 Information Theory First, we provide some information theory background. Definition 3.1 (Entropy). The entropy of a discrete
More informationGeneralization Bounds and Stability
Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 9 2009 About this class Goal To recall the notion of generalization bounds and show how they can be derived from a stability
More information5 Mutual Information and Channel Capacity
5 Mutual Information and Channel Capacity In Section 2, we have seen the use of a quantity called entropy to measure the amount of randomness in a random variable. In this section, we introduce several
More informationChapter 6 The Structural Risk Minimization Principle
Chapter 6 The Structural Risk Minimization Principle Junping Zhang jpzhang@fudan.edu.cn Intelligent Information Processing Laboratory, Fudan University March 23, 2004 Objectives Structural risk minimization
More informationPAC Generalization Bounds for Co-training
PAC Generalization Bounds for Co-training Sanjoy Dasgupta AT&T Labs Research dasgupta@research.att.com Michael L. Littman AT&T Labs Research mlittman@research.att.com David McAllester AT&T Labs Research
More informationOptimization Methods for Machine Learning (OMML)
Optimization Methods for Machine Learning (OMML) 2nd lecture (2 slots) Prof. L. Palagi 16/10/2014 1 What is (not) Data Mining? By Namwar Rizvi - Ad Hoc Query: ad Hoc queries just examines the current data
More informationInformation Theory in Intelligent Decision Making
Information Theory in Intelligent Decision Making Adaptive Systems and Algorithms Research Groups School of Computer Science University of Hertfordshire, United Kingdom June 7, 2015 Information Theory
More informationDomain Adaptation Can Quantity Compensate for Quality?
Domain Adaptation Can Quantity Compensate for Quality? hai Ben-David David R. Cheriton chool of Computer cience University of Waterloo Waterloo, ON N2L 3G1 CANADA shai@cs.uwaterloo.ca hai halev-hwartz
More informationIntroduction to Machine Learning (67577) Lecture 7
Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew
More informationProbably Approximately Correct (PAC) Learning
ECE91 Spring 24 Statistical Regularization and Learning Theory Lecture: 6 Probably Approximately Correct (PAC) Learning Lecturer: Rob Nowak Scribe: Badri Narayan 1 Introduction 1.1 Overview of the Learning
More informationChapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye
Chapter 2: Entropy and Mutual Information Chapter 2 outline Definitions Entropy Joint entropy, conditional entropy Relative entropy, mutual information Chain rules Jensen s inequality Log-sum inequality
More informationWe choose parameter values that will minimize the difference between the model outputs & the true function values.
CSE 4502/5717 Big Data Analytics Lecture #16, 4/2/2018 with Dr Sanguthevar Rajasekaran Notes from Yenhsiang Lai Machine learning is the task of inferring a function, eg, f : R " R This inference has to
More informationIntroduction to Statistical Learning Theory. Material para Máster en Matemáticas y Computación
Introduction to Statistical Learning Theory Material para Máster en Matemáticas y Computación 1 Learning agents Inductive learning Decision tree learning First Part: Outline 2 Learning Learning is essential
More informationUniversal probability-free conformal prediction
Universal probability-free conformal prediction Vladimir Vovk and Dusko Pavlovic March 20, 2016 Abstract We construct a universal prediction system in the spirit of Popper s falsifiability and Kolmogorov
More informationAn Introduction to Statistical Machine Learning - Theoretical Aspects -
An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,
More informationKolmogorov complexity
Kolmogorov complexity In this section we study how we can define the amount of information in a bitstring. Consider the following strings: 00000000000000000000000000000000000 0000000000000000000000000000000000000000
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationComputational Learning Theory (VC Dimension)
Computational Learning Theory (VC Dimension) 1 Difficulty of machine learning problems 2 Capabilities of machine learning algorithms 1 Version Space with associated errors error is the true error, r is
More informationLecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.
Lecture Slides for INTRODUCTION TO Machine Learning By: alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml Postedited by: R. Basili Learning a Class from Examples Class C of a family car Prediction:
More informationMachine Learning Basics: Maximum Likelihood Estimation
Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning
More informationRelationship between Least Squares Approximation and Maximum Likelihood Hypotheses
Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses Steven Bergner, Chris Demwell Lecture notes for Cmpt 882 Machine Learning February 19, 2004 Abstract In these notes, a
More informationComputational Learning Theory. CS534 - Machine Learning
Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows
More informationCOMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization
: Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage
More informationClass 2 & 3 Overfitting & Regularization
Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating
More informationIntroduction to Kleene Algebras
Introduction to Kleene Algebras Riccardo Pucella Basic Notions Seminar December 1, 2005 Introduction to Kleene Algebras p.1 Idempotent Semirings An idempotent semiring is a structure S = (S, +,, 1, 0)
More informationORIE 4741: Learning with Big Messy Data. Generalization
ORIE 4741: Learning with Big Messy Data Generalization Professor Udell Operations Research and Information Engineering Cornell September 23, 2017 1 / 21 Announcements midterm 10/5 makeup exam 10/2, by
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More informationStatistical and Inductive Inference by Minimum Message Length
C.S. Wallace Statistical and Inductive Inference by Minimum Message Length With 22 Figures Springer Contents Preface 1. Inductive Inference 1 1.1 Introduction 1 1.2 Inductive Inference 5 1.3 The Demise
More informationBayesian Support Vector Machines for Feature Ranking and Selection
Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction
More informationLearning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14
Learning Theory Piyush Rai CS5350/6350: Machine Learning September 27, 2011 (CS5350/6350) Learning Theory September 27, 2011 1 / 14 Why Learning Theory? We want to have theoretical guarantees about our
More informationTopics in Natural Language Processing
Topics in Natural Language Processing Shay Cohen Institute for Language, Cognition and Computation University of Edinburgh Lecture 9 Administrativia Next class will be a summary Please email me questions
More informationInterpreting Deep Classifiers
Ruprecht-Karls-University Heidelberg Faculty of Mathematics and Computer Science Seminar: Explainable Machine Learning Interpreting Deep Classifiers by Visual Distillation of Dark Knowledge Author: Daniela
More informationRegret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss
Regret Analysis for Performance Metrics in Multi-Label Classification The Case of Hamming and Subset Zero-One Loss Krzysztof Dembczyński 1, Willem Waegeman 2, Weiwei Cheng 1, and Eyke Hüllermeier 1 1 Knowledge
More informationMachine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017
Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationGradient Boosting, Continued
Gradient Boosting, Continued David Rosenberg New York University December 26, 2016 David Rosenberg (New York University) DS-GA 1003 December 26, 2016 1 / 16 Review: Gradient Boosting Review: Gradient Boosting
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More informationMachine Learning Theory (CS 6783)
Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Kimball, B-11 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)
More information