Learning Structured Probability Matrices!

Size: px

Start display at page:

Download "Learning Structured Probability Matrices!"

Lindsey Stephens
5 years ago
Views:

1 Learning Structured Probability Matrices Qingqing Huang 2016 February Laboratory for Information & Decision Systems Based on joint work with Sham Kakade, Weihao Kong and Greg Valiant. 1

2 2 Learning Data generation X()

3 3 Learning Data generation X() Learning Infer about the underlying rule θ (Estimation, approximation, property testing, optimization of f(θ) )

4 4 Learning Data generation X() Learning Infer about the underlying rule θ (Estimation, approximation, property testing, optimization of f(θ) ) Challenge: Exploit our prior for structure of the underlying θ to design fast algorithm that uses as few as possible data X to achieve the target accuracy in learning θ

5 5 Learning Data generation X() Learning Infer about the underlying rule θ (Estimation, approximation, property testing, optimization of f(θ) ) Challenge: Exploit our prior for structure of the underlying θ to design fast algorithm that uses as few as possible data X to achieve the target accuracy in learning θ Computation Complexity Sample Complexity dim()

6 Unstructured Probability Discrete probability distribution (support over M outcomes) p X N i.i.d. samples N Poisson (frequency counts over M outcomes) B = Poisson(Np) 6

7 7 Unstructured Probability Discrete probability distribution (support over M outcomes) p X N i.i.d. samples N Poisson (frequency counts over M outcomes) B = Poisson(Np) M =5 4 N =

8 Unstructured Probability Discrete probability distribution (support over M outcomes) p X N i.i.d. samples N Poisson (frequency counts over M outcomes) B = Poisson(Np) M =5 4 N = Goal: find bp such that kbp pk 1 apple N (é M) sample complexity: upper / lower bound 8

9 9 Structured Probability X Probability Matrix B 2 R M M + N i.i.d. samples (distribution over M 2 outcomes) ( freq counts over outcomes) M 2 B = Poisson(NB)

10 10 Structured Probability X Probability Matrix B 2 R M M + N i.i.d. samples (distribution over M 2 outcomes) ( freq counts over outcomes) M 2 B is of low rank B = pp > + qq > B = Poisson(NB)

11 11 Structured Probability X Probability Matrix B 2 R M M + N i.i.d. samples (distribution over M 2 outcomes) ( freq counts over outcomes) M 2 B is of low rank B = pp > + qq > B = Poisson(NB) M = N = B.10 p.20 q B

12 Structured Probability X Probability Matrix B 2 R M M + N i.i.d. samples (distribution over M 2 outcomes) ( freq counts over outcomes) M 2 B is of low rank B = pp > + qq > B = Poisson(NB) M = N = B.10 p.20 q B? Goal: find rank-2 bb k b B Bk 1 apple such that N (é M) sample complexity: upper / lower bound 12

13 13 Connection to Learning Mixture Models Topic model Pr(word 1, word 2 topic = T 1 )=pp > Pr(word 1, word 2 topic = T 2 )=qq > B joint distribution over word pairs

14 14 Connection to Learning Mixture Models Topic model Pr(word 1, word 2 topic = T 1 )=pp > Pr(word 1, word 2 topic = T 2 )=qq > B joint distribution over word pairs HMM Pr(output 1, output 2 state = S i )=O i (OQ i ) > B distribution of past and future outputs

output 2 state = S i )=O i (OQ i ) > B distribution of past and future outputs Spectral

15 15 Connection to Learning Mixture Models Topic model Pr(word 1, word 2 topic = T 1 )=pp > Pr(word 1, word 2 topic = T 2 )=qq > B joint distribution over word pairs HMM Pr(output 1, output 2 state = S i )=O i (OQ i ) > B distribution of past and future outputs Spectral Algorithm: N data samples Estimate model parameters empirical dsitribution find low rank close to B b B B

16 16 Structured Distribution Learning Discrete probability distribution X i.i.d. samples counts Sample complexity Unstructured Low rank structure Estimation Linear? Property Testing Sub-Linear?

17 Sub-Optimal Attempt Probability Matrix B 2 R M M + B is of rank 2: B = > > + X N i.i.d. samples B = Poisson(NB) 17

18 Sub-Optimal Attempt Probability Matrix B 2 R M M + B is of rank 2: B = > > + X N i.i.d. samples B = Poisson(NB) MLE is non-convex optimization L let s try something intuitive J 18

19 Sub-Optimal Attempt Probability Matrix B 2 R M M + B is of rank 2: B = > > + X N i.i.d. samples B = Poisson(NB) 1 N B = 1 N Poisson(NB) B, as N 1 Set bb to be the rank 2 truncated SVD of 1 N B 19

20 Sub-Optimal Attempt Probability Matrix B 2 R M M + B is of rank 2: B = > > + X N i.i.d. samples B = Poisson(NB) 1 N B = 1 N Poisson(NB) B, as N 1 Set bb to be the rank 2 truncated SVD of 1 N B To achieve accuracy kb b Bk 1 apple need N = (M 2 log M) 20

21 Sub-Optimal Attempt Probability Matrix B 2 R M M + B is of rank 2: B = > > + X N i.i.d. samples B = Poisson(NB) 1 N B = 1 N Poisson(NB) B, as N 1 Set bb to be the rank 2 truncated SVD of 1 N B To achieve accuracy kb b Bk 1 apple need N = (M 2 log M) Not sample efficient Hopefully N = (M) 21

22 22 Sub-Optimal Attempt Probability Matrix B 2 R M M + B is of rank 2: B = > > + X N i.i.d. samples B = Poisson(NB) 1 N B = 1 N Poisson(NB) B, as N 1 Set bb to be the rank 2 truncated SVD of 1 N B To achieve accuracy kb b Bk 1 apple need N = (M 2 log M) Not sample efficient Hopefully N = (M) Small data in practice Word distribution in language has fat tail. More sample documents N, larger the vocabulary size M

23 Main Results Our upper bound algorithm: Rank-2 estimate with accuracy Using number of samples Runtime N = O(M/ 2 ) O(M 3 ) We prove (strong) lower bound: bb k b B Bk 1 apple 8 > 0 Need a sequence of N = (M) observations to test whether the sequence is i.i.d. of unif (M) or generated by a 2-state HMM 23

24 Main Results Our upper bound algorithm: Rank-2 estimate with accuracy Using number of samples Runtime N = O(M/ 2 ) O(M 3 ) We prove (strong) lower bound: bb k b B Bk 1 apple 8 > 0 Need a sequence of N = (M) observations to test whether the sequence is i.i.d. of unif (M) or generated by a 2-state HMM 24

25 Main Results Our upper bound algorithm: Rank-2 estimate with accuracy Using number of samples Runtime N = O(M/ 2 ) O(M 3 ) We prove (strong) lower bound: bb k b B Bk 1 apple 8 > 0 Need a sequence of N = (M) observations to test whether the sequence is i.i.d. of unif (M) or generated by a 2-state HMM Sample complexity Unstructured Low rank structure Estimation Linear Linear Property Testing Sub-Linear Linear 25

26 26 Algorithmic Idea We capitalize the idea of community detection in sparse random network, SBM is a special case of our problem formulation with homogeneous nodes

27 M nodes 2 communities Algorithmic Idea We capitalize the idea of community detection in sparse random network, SBM is a special case of our problem formulation with homogeneous nodes Expected connection Adjacency matrix B = pp > + qq > B = Bernoulli(NB) generate estimate B p q B 27

Bernoulli(NB).09.09.09.02.02.02.09.09.09.02.02.02.09.09.09.02.02.02.30.30.30.03.

28 M nodes 2 communities Algorithmic Idea We capitalize the idea of community detection in sparse random network, SBM is a special case of our problem formulation with homogeneous nodes Expected connection Adjacency matrix B = pp > + qq > B = Bernoulli(NB) generate estimate B p q B 28

29 M nodes 2 communities Algorithmic Idea We capitalize the idea of community detection in sparse random network, SBM is a special case of our problem formulation with homogeneous nodes Expected connection Adjacency matrix B = pp > + qq > B = Bernoulli(NB) Regularize Truncated SVD: [Le, Levina, Vershynin] remove heavy row/column from B, run rank-2 SVD on the remaining graph B p q generate estimate B 29

30 30 Algorithmic Idea We capitalize the idea of community detection in sparse random network, SBM is a special case of our problem formulation with homogeneous nodes M nodes 2 communities Expected connection Adjacency matrix B = pp > + qq > B = Bernoulli(NB) Regularize Truncated SVD: [Le, Levina, Vershynin] remove heavy row/column from B, run rank-2 SVD on the remaining graph M M matrix Probability matrix B = > + > Sample counts B = Poisson(N B) Key Challenge: In our general setup, we have heterogeneous nodes/ marginal probabilities

31 Algorithmic Idea 1, Binning We group the M nodes according to the empirical marginal probability, divide the matrix into blocks, then apply regularized t-svd to each block Phase 1 1. Estimate non-uniform marginal b 2. Bin M nodes according to b i 3. Regularized t-svd in each bin bin block of B 31

Estimate non-uniform marginal b 2. Bin M nodes according to b i 3.

32 Algorithmic Idea 1, Binning We group the M nodes according to the empirical marginal probability, divide the matrix into blocks, then apply regularized t-svd to each block Phase 1 1. Estimate non-uniform marginal b 2. Bin M nodes according to b i 3. Regularized t-svd in each bin bin block of B Key Challenges: ü Binning is not exact, we need to deal with spillover ü We need to piece together estimates over bins 32

Make use of that to do local refinement for each row / column of B Phase

33 33 Algorithmic Idea 2, Refinement The coarse estimation from Phase 1 gives some global information. Make use of that to do local refinement for each row / column of B Phase 2 1. Refine the estimate for each node use linear regression 2. Achieve sample complexity N = O(M/ 2 ) minmax optimal

34 34 Take-Away Message Structured Data generation Learning X() We identify a problem that lies at the core of many learning problems Spectral algorithm is not solving for the non-convex MLE, we need carefully designed algorithm to improve its statistical efficiency Coming soon: estimation / approximation / property testing of structured probability distribution

Efficient Spectral Methods for Learning Mixture Models!

Efficient Spectral Methods for Learning Mixture Models Qingqing Huang 2016 February Laboratory for Information & Decision Systems Based on joint works with Munther Dahleh, Rong Ge, Sham Kakade, Greg Valiant.