Topic III.2: Maximum Entropy Models

Similar documents
MATH2070 Optimisation

Data Mining and Matrices

Seminars on Mathematics for Economics and Finance Topic 5: Optimization Kuhn-Tucker conditions for problems with inequality constraints 1

Lecture 9: Implicit function theorem, constrained extrema and Lagrange multipliers

g(x,y) = c. For instance (see Figure 1 on the right), consider the optimization problem maximize subject to

Max Margin-Classifier

Link lecture - Lagrange Multipliers

Lagrange Multipliers

Linear and non-linear programming

Computation. For QDA we need to calculate: Lets first consider the case that

Constrained Optimization

Lecture 4: Optimization. Maximizing a function of a single variable

Mechanical Systems II. Method of Lagrange Multipliers

Optimization using Calculus. Optimization of Functions of Multiple Variables subject to Equality Constraints

(Kernels +) Support Vector Machines

10-725/36-725: Convex Optimization Prerequisite Topics

Pattern Recognition and Machine Learning

Lagrangian Duality. Evelien van der Hurk. DTU Management Engineering

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

Conditional Gradient (Frank-Wolfe) Method

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Lecture 4 September 15

Introduction to Machine Learning. Lecture 2

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

March 8, 2010 MATH 408 FINAL EXAM SAMPLE

Statistical Machine Learning from Data

Interesting Patterns. Jilles Vreeken. 15 May 2015

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Lagrangian Duality. Richard Lusby. Department of Management Engineering Technical University of Denmark

Lecture 2: Minimax theorem, Impagliazzo Hard Core Lemma

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Two hours. To be provided by Examinations Office: Mathematical Formula Tables. THE UNIVERSITY OF MANCHESTER. xx xxxx 2017 xx:xx xx.

Constrained Optimization and Lagrangian Duality

Lagrange multipliers. Portfolio optimization. The Lagrange multipliers method for finding constrained extrema of multivariable functions.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Recita,on: Loss, Regulariza,on, and Dual*

Lagrange Relaxation and Duality

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Exponentiated Gradient Descent

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

MATH529 Fundamentals of Optimization Constrained Optimization I

Review of Optimization Methods

Lecture 8: Determinants I

Constrained Optimization

12.1 Taylor Polynomials In Two-Dimensions: Nonlinear Approximations of F(x,y) Polynomials in Two-Variables. Taylor Polynomials and Approximations.

SF2822 Applied Nonlinear Optimization. Preparatory question. Lecture 9: Sequential quadratic programming. Anders Forsgren

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Constrained optimization

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Part IB Optimisation

LMI Methods in Optimal and Robust Control

Spring 2012 Math 541A Exam 1. X i, S 2 = 1 n. n 1. X i I(X i < c), T n =

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Moment-based Availability Prediction for Bike-Sharing Systems

UC Berkeley Department of Electrical Engineering and Computer Science. EECS 227A Nonlinear and Convex Optimization. Solutions 5 Fall 2009

Math 164-1: Optimization Instructor: Alpár R. Mészáros

Optimal control problems with PDE constraints

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

EC /11. Math for Microeconomics September Course, Part II Lecture Notes. Course Outline

Homework 4. Convex Optimization /36-725

EC /11. Math for Microeconomics September Course, Part II Problem Set 1 with Solutions. a11 a 12. x 2

Linear Programming: Simplex

Examination paper for TMA4180 Optimization I

Applications of Linear Programming

4TE3/6TE3. Algorithms for. Continuous Optimization

Lecture 3. Optimization Problems and Iterative Algorithms

IE 5531 Midterm #2 Solutions

x +3y 2t = 1 2x +y +z +t = 2 3x y +z t = 7 2x +6y +z +t = a

Chapter 2. Optimization. Gradients, convexity, and ALS

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

CONSTRAINED NONLINEAR PROGRAMMING

ECON 186 Class Notes: Optimization Part 2

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

Generalization to inequality constrained problem. Maximize

Confidence-constrained joint sparsity recovery under the Poisson noise model

SF2822 Applied nonlinear optimization, final exam Wednesday June

SIMPLEX LIKE (aka REDUCED GRADIENT) METHODS. REDUCED GRADIENT METHOD (Wolfe)

ELE539A: Optimization of Communication Systems Lecture 6: Quadratic Programming, Geometric Programming, and Applications

Variable-Specific Entropy Contribution

Final Exam Advanced Mathematics for Economics and Finance

Statistical Learning Theory and the C-Loss cost function

Naive Bayes and Gaussian Bayes Classifier

Exponential families also behave nicely under conditioning. Specifically, suppose we write η = (η 1, η 2 ) R k R p k so that

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Linear & nonlinear classifiers

Compressed Sensing and Sparse Recovery

Method of Lagrange Multipliers

N. L. P. NONLINEAR PROGRAMMING (NLP) deals with optimization models with at least one nonlinear function. NLP. Optimization. Models of following form:

A Brief Review on Convex Optimization

10-725/ Optimization Midterm Exam

CPSC 540: Machine Learning

Numerical optimization

Machine Learning Srihari. Information Theory. Sargur N. Srihari

Linear & nonlinear classifiers

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

5 Handling Constraints

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Transcription:

Topic III.2: Maximum Entropy Models Discrete Topics in Data Mining Universität des Saarlandes, Saarbrücken Winter Semester 2012/13 T III.2-1

Topic III.2: Maximum Entropy Models 1. The Maximum Entropy Principle 1.1. Maximum Entropy Distributions 1.2. Lagrange Multipliers 2. MaxEnt Models for Tiling 2.1. The Distribution for Constrains on Margins 2.2. Using the MaxEnt Model 2.3. Noisy Tiles 3. MaxEnt Models for Real-Valued Data T III.2-2

The Maximum-Entropy Principle Goal: To define a distribution over data that satisfies given constraints Row/column sums Distribution of values Given such a distribution We can sample from it (as with swap randomization) We can compute the likelihood of the observed data We can compute how surprising our findings are given the distribution De Bie 2010 T III.2-3

Maximum Entropy We expect the constraints to be linear If x X is one data set, Pr(x) is the distribution, and fi(x) is a real-valued function of the data, the constraints are of type x Pr(x)fi(x) = di Many distributions can satisfy the constraints; which to choose? We want to select the distribution that maximizes the entropy and satisfies the constraints Entropy of a discrete distribution: x Pr(x)log(Pr(x)) T III.2-4

Why Maximize the Entropy? No other assumptions Any distribution with less-than-maximal entropy must have some reason for the reduced entropy Essentially, a latent assumption about the distribution We want to avoid these Optimal worst-case behaviour w.r.t. coding lenghts If we build an encoding based on the maximum entropy distribution, the worst-case expected encoding length is the minimum over any distribution T III.2-5

Finding the MaxEnt Distribution Finding the MaxEnt distribution is a convex program with linear constraints max Pr(x)  x Pr(x)logPr(x) s.t.  x Pr(x) f i (x)=d i for all i  x Pr(x)=1 Can be solved, e.g., using the Lagrange multipliers T III.2-6

Intermezzo: Lagrange multipliers A method to find extrema of constrained functions via derivation Problem: minimize f(x) subject to g(x) = 0 Without constraint we can just derive f(x) But the extrema we obtain might be unfeasible given the constraints Solution: introduce Lagrange multiplier λ Minimize L(x, λ) = f(x) λg(x) f(x) λ g(x) = 0 L/ xi = f/ xi λ g/ xi = 0 for all i L/ λ = g(x) = 0 T III.2-7

Intermezzo: Lagrange multipliers A method to find extrema of constrained functions via derivation Problem: minimize f(x) subject to g(x) = 0 Without constraint we can just derive f(x) But the extrema we obtain might be unfeasible given the constraints Solution: introduce Lagrange multiplier λ Minimize L(x, λ) = f(x) λg(x) f(x) λ g(x) = 0 L/ xi = f/ xi λ g/ xi = 0 for all i L/ λ = g(x) = 0 The constraint! T III.2-7

More on Lagrange multipliers For many constraints, we need to add one multiplier for each constraint L(x,λ) = f(x) Σj λjgj(x) Function L is known as the Lagrangian Minimizing the unconstrained Lagrangian equals minimizing the constrained f But not all solutions to f(x) Σjλj gj(x) = 0 are extrema The solution is in the boundary of the constraint only if λj 0 T III.2-8

Example minimize f(x,y) = x 2 y subject to g(x,y) = x 2 + y 2 = 3 T III.2-9

Example minimize f(x,y) = x 2 y subject to g(x,y) = x 2 + y 2 = 3 L(x,y,λ) = x 2 y + λ(x 2 + y 2 3) T III.2-9

Example minimize f(x,y) = x 2 y subject to g(x,y) = x 2 + y 2 = 3 L(x,y,λ) = x 2 y + λ(x 2 + y 2 3) @L @x = 2xy + 2 x = 0 @L @y = x2 + 2 y = 0 @L @ = x2 + y 2-3 = 0 T III.2-9

Example minimize f(x,y) = x 2 y subject to g(x,y) = x 2 + y 2 = 3 L(x,y,λ) = x 2 y + λ(x 2 + y 2 3) @L @x = 2xy + 2 x = 0 @L @y = x2 + 2 y = 0 @L @ = x2 + y 2-3 = 0 Solution: x = ± 2, y = 1 T III.2-9

Solving the MaxEnt The Lagrangian is L(Pr(x),µ,l)=  x Pr(x)logPr(x) + i l i  i Pr(x) f i (x) d i! + µ  x Pr(x) 1 Setting the derivative w.r.t. Pr(x) to 0 gives Pr(x)= 1 Z(l) exp  i l i f i (x)! Where Z(l)= x exp  i l i f i (x) is called the partition function T III.2-10

The Dual and the Solution Subtituting the Pr(x) in the Lagrangian yields the dual objective L(l)=log Z(l) Â i l i d i Minimizing the dual gives the maximal solution to the original constrained equation The dual is convex, and can therefore be minimized using well-known methods T III.2-11

Using the MaxEnt Distribution p-values: we can sample from the distribution and rerun the algorithm as with swap randomization Self-information: the negative log-probability of the observed pattern under the MaxEnt model is its selfinformation The higher, the more information the pattern contains Information compression ratio: more complex patterns are harder to communicate (longer description length); when contrasted to selfinformation, this gives us the information compression ratio T III.2-12

MaxEnt Models for Tiling The Tiling problem Binary data, aim to find fully monochromatic submatrices Constraints: the expected row and column margins  Pr(D) D2{0,1} n m  Pr(D) D2{0,1} n m! m  d ij = r i j=1! n  d ij = c j i=1 Note that these are in the correct form De Bie 2010 T III.2-13

The MaxEnt Distribution Using the Lagrangian, we can solve the Pr(D), where Pr(D)= i, j Note that Pr(D) is a product of independent elements We did not enforce this independency, it s a consequence of the MaxEnt model 1 Z(l r i,lc j ) exp d ij(l r i + l c j) d Z(l r i,lc j )= Â dij 2{0,1} exp ij (l r i + lc j ) Also, each element is Bernoulli distributed with success probability exp(l r i + l c j )/ 1 + exp(lr i + lc j ) T III.2-14

Other Domains If our data contains nonnegative integers, the distribution changes to the geometric distribution with success probability 1 exp(l r i + l c j ) If our data contains nonnegative real numbers, the partition function becomes Z Z(l r i,l c j)= exp x(l r i + l c 1 j) dx = 0 l r i + lc j Assuming l r i + lc j < 0 The distribution is the exponential distribution with rate parameter (l r i + lc j ) for dij Note: a continuous distribution T III.2-15

Maximizing the Entropy The optimal Lagrange multipliers can be found using standard gradient descent methods Requires computing the gradient for the multipliers There are m + n multipliers for an n-by-m matrix But we only need to consider λs for distinct ri and cj, which can be considerably less E.g. (2s) for s non-zeros in a binary matrix Overall worst-case time per iteration is O(s) for gradient descent For Newton s method, it s O( s 3 ) T III.2-16

MaxEnt and Swap Randomization MaxEnt models constrain the expected margins; swap randomization constrains the actual margins Does it matter? If M(r, c) is the set of all n-by-m binary matrices with same row and column margins, the MaxEnt model will give the same probability for each matrix in M(r, c) More generally, the probability is invariant under adding a constant in the diagonal and reducing it from the antidiagonal of any 2-by-2 submatrix T III.2-17

The Interestingness of a Tile Given a tile τ and a MaxEnt model for the binary data (w.r.t. row and column margins), the self-information of τ is  (i, j)2t log(p ij ) p ij = exp(l r i + lc j )/ 1 + exp(lr i + lc j ) The description length of the tile is the number of bits it takes to explain the tile The compression ratio of τ is the fraction SelfInformation(τ)/DescriptionLength(τ) T III.2-18

Set of Tiles The description length for a set of tiles is the sum of tiles description lengths The self-information for a set of tiles is the selfinformation of their union Repeatedly covering a value doesn t increase the selfinformation Finding a set of tiles with maximum self-information but with a description length below a threshold is NPhard problem Budgeted maximum coverage A greedy approximation achieves (e 1)/e approximation T III.2-19

Noisy Tiles If we allow noisy tiles, the self-information changes The 0s also convey information SelfInformation(t)= Â log (i, j)2t: d ij =1 + Â log (i, j)2t: d ij =0 exp(l r! i + lc j ) 1 + exp(l r i + lc j )! 1 1 + exp(l r i + lc j ) The location of 0s in the tile can be encoded in the description length using at most log IJ bits for a tile n 0 of size I-by-J that have n0 zeros Kontonasion & De Bie 2010 T III.2-20

Real-Valued Data We already saw how to build MaxEnt model with constraints on the means of rows and columns Here: constraint means and variances or constraint the histograms of rows and columns Similar to the options from last week Second option is obviously stronger Kontonasios, Vreeken & De Bie 2011 T III.2-21

Preserving Means and Variances To preserve row and column means and variances, we need to constraint Row and column sums Row and column sums-of-squares After solving the MaxEnt equation, we again get that the MaxEnt distribution for D is a product of probabilities for dij Pr(dij) ~ N l r i +lc j 2(µ r i +µc j ), 2(µr i + µc j ) 1/2 λs are Lagrange multipliers associated with the constraints on sums µs are Lagrange multipliers associated with the constraints on sums-of-squares T III.2-22

Preserving the Histograms We can express the distribution using a histogram of its values Bin number and widths are selected automatically based on MDL The constraints for histograms requires we keep the contents of the bins (on expectation) intact The resulting distribution is a histogram itself T III.2-23

Some Notes These methods again assume that summing over rows and columns makes sense Sampling is considerably faster that with swap randomizations Order-of-magnitude difference in worst case MaxEnt models also allow computing analytical p- values for individual patterns T III.2-24

Essay Topics Swap-based methods vs maximum entropy methods What are they? How they work? Similarities? Differences? Is one better than other? Consider both binary and continuous cases Method for finding a frequency threshold for significant itemsets vs other methods Kirsch et al. 2012 paper Explained in the TIII.intro lecture How is it different from the swap-based or MaxEnt based methods we ve discussed Only for binary data DL 29 January T III.2-25

Exam Information 19 February (Tuesday) Oral exam Room 021 at MPII building (E1.4) Time frame: 10 am 6 pm If you have constraints within this time frame, send me email About 20 min per student I will ask questions on one or two topic areas You can veto one proposed topic are but only one T III.2-26