Learning about State. Geoff Gordon Machine Learning Department Carnegie Mellon University
|
|
- Rolf Johnston
- 5 years ago
- Views:
Transcription
1 Learning about State Geoff Gordon Machine Learning Department Carnegie Mellon University joint work with Byron Boots, Sajid Siddiqi, Le Song, Alex Smola
2 2 What s out there? ot-2 ot-1 ot ot+1 ot+2
3 2 What s out there? ot-2 ot-1 ot ot+1 ot+2
4 2 What s out there? ot-2 ot-1 ot ot+1 ot+2 steam rising from a grate
5 What s out there? A dynamical system Past Future ot-2 ot-1 ot ot+1 ot+2 State Dynamical system = recursive rule for updating state based on observations 3
6 What s out there? A dynamical system Past Future ot-2 ot-1 ot ot+1 ot+2 State Dynamical system = recursive rule for updating state based on observations 3
7 Learning a dynamical system Past Future ot-2 ot-1 ot ot+1 ot+2 Given past observations from a partially observable system State 4
8 Learning a dynamical system Past Future ot-2 ot-1 ot ot+1 ot+2 Given past observations from a partially observable system State 4
9 Learning a dynamical system Past Future ot-2 ot-1 ot ot+1 ot+2 Given past observations from a partially observable system State Predict future observations 4
10 5 Examples Baum-Welsh EM algorithm for HMMs Tomasi-Kanade structure from motion Black-Scholes model of stock price SLAM (from lidars, cameras, beacons, ) System identification for Kalman filters
11 6 A general principle predict data about past (many samples) state data about future (many samples) compress bottleneck expand
12 6 A general principle predict data about past (many samples) state data about future (many samples) compress bottleneck expand If bottleneck = rank constraint, get a spectral method
13 7 Why spectral methods? Many ways to learn models of dynamical systems max likelihood via EM, gradient descent, Bayesian inference via Gibbs, MH, In contrast to these, spectral methods give no local optima! huge gain in computational efficiency slight loss in statistical efficiency
14 8 Example: SSID for Kalman filter n n x = A x + noise o = C x + noise n A m C Past data = last k observations Future data = next k observations must have k and k big enough Prediction = linear regression look at empirical covariance of past & future Spectral: bottleneck = SVD of covariance [van Overschee & de Moor, 1996]
15 Kalman SSID x = A x + noise o = C x + noise C A P C T Assume for simplicity m n, both A and C full rank For k 1, E[o t+k o t ] = E[E[o t+k o t x t ]] = E[E[o t+k x t ]E[o t x t ]] = E[CA k x t (Cx t ) ] = CA k E[x t x t ]C = CA k PC 9
16 10 Kalman SSID Σ k = E[o t+k o t ] = CA k PC Let U = left n leading singular vectors of Σ1 Â Ĉ def = U Σ 2 (U Σ 1 ) = U CA 2 PC (U CAP C ) = (U CA)AP C (PC ) (U CA) 1 = SAS 1 def = U(SAS 1 ) 1 = USA 1 S 1 = U(U CA)A 1 S 1 = CS 1
17 11 Kalman SSID Algorithm: estimate Σ1 and Σ2 from data, get Û by SVD, plug in for  and Ĉ Consistent: continuity of formulas for  and Ĉ, law of large numbers for Σ1 and Σ2 wrinkle: SVD for Û isn t continuous, but range(û) is Can also recover steady-state x
18 12 Variations Use arbitrary features of length-k window of past and future observations work from covariance of past, future features good features make a big difference in practice Impose constraints on learned model (e.g., stability)
19 13 Kalman SSID: example Works well for video textures steam grate example above fountain: observation = raw pixels (vector of reals over time)
20 14 Structure from motion feature 1, step 2 x 11 y 11 x 12 y x 1T y 1T x 21 y 21 x 22 y x 2T y 2T x N1 y N1 x N2 y N2... x NT y NT Track N features over T steps [Tomasi & Kanade, 1992]
21 15 Structure from motion x 11 y 11 x 12 y x 1T y 1T x 21 y 21 x 22 y x 2T y 2T x N1 y N1 x N2 y N2... x NT y NT xit is projection of feature i onto camera s horizontal axis at time t (and yit, vertical) [ui, vi, wi] = feature i coordinates [h1t, h2t, h3t] = camera horizontal [v1t, v2t, v3t] = camera vertical o o o o
22 16 Structure from motion u 1 v 1 w 1 u 2 v 2 w h 11 v 11 h 12 v h 1T v 1T h 21 v 21 h 22 v h 2T v 2T h 31 v 31 h 32 v h 3T v 3T u N v N w N xit is projection of feature i onto camera s horizontal axis at time t (and yit, vertical) [ui, vi, wi] = feature i coordinates [h1t, h2t, h3t] = camera horizontal [v1t, v2t, v3t] = camera vertical o o o o
23 17 Structure from motion only determined up to an invertible transform
24 18 SfM as SSID cov = x 11 y 11 x 12 y x 1T y 1T x 21 y 21 x 22 y x 2T y 2T x N1 y N1 x N2 y N2... x NT y NT Past data: indicator of time step & h/v axis means we get to memorize each time step no attempt to learn dynamics Future data: observed screen coordinates (column of matrix)
25 19 Kalman SSID: failure HMM (Baum-Welsh) Kalman Filter (SSID) Preview all models: 10 latent dimensions
26 19 Kalman SSID: failure HMM (Baum-Welsh) Kalman Filter (SSID) Preview all models: 10 latent dimensions
27 19 Kalman SSID: failure HMM (Baum-Welsh) Kalman Filter (SSID) Preview all models: 10 latent dimensions
28 19 Kalman SSID: failure HMM (Baum-Welsh) Kalman Filter (SSID) Preview all models: 10 latent dimensions
29 Can we generalize? n n x = A x + noise o = C x + noise n A m C Get rid of Gaussian noise assumption HMM: same form as Kalman filter, but A 0, A1 = 1, C 0, C1 = 1 noise ~ multinomial x, o are indicators: e.g., 4 = [ ] T 20
30 Derivations for Kalman v. HMM Kalman filter HMM E[o t+k o t ] = E[E[o t+k o t x t ]] = E[E[o t+k x t ]E[o t x t ]] = E[CA k x t (Cx t ) ] = CA k E[x t x t ]C = CA k PC Assume for simplicity m n, both A and C full rank 21
31 Derivations for Kalman v. HMM Kalman filter HMM E[o t+k o t ] = E[E[o t+k o t x t ]] = E[E[o t+k x t ]E[o t x t ]] = E[CA k x t (Cx t ) ] = CA k E[x t x t ]C = CA k PC E[o t+k o t ] = E[E[o t+k o t x t ]] = E[E[o t+k x t ]E[o t x t ]] = E[CA k x t (Cx t ) ] = CA k E[x t x t ]C = CA k PC Assume for simplicity m n, both A and C full rank 21
32 HMM SSID: first try U Σ 2 (U Σ 1 ) = U CA 2 PC (U CAP C ) = (U CA)AP C (PC ) (U CA) 1 = SAS 1 As before, recover  and Ĉ from E[ot+1ot T ] & E[ot+2ot T ] C A P C T Doesn t satisfy A 0, A1 = 1, C 0, C1 = 1 is this a problem? 22
33 23 Merging A & C HMM tracking: write bt = P[xt o1:t] P[xt o1:t-1] = bt-0.5 = A bt-1 P[ot o1:t-1] = C bt-0.5 if ot = o: where Z = P(ot=o bt-1) P(xt=x o1:t) = P(o xt=x) P(xt=x o1:t-1) / Z i.e., bt = diag(co,:) bt-0.5 / Z Write Ao = diag(co,:) A bt = Ao bt-1 / Z
34 23 Merging A & C HMM tracking: write bt = P[xt o1:t] P[xt o1:t-1] = bt-0.5 = A bt-1 P[ot o1:t-1] = C bt-0.5 if ot = o: where Z = P(ot=o bt-1) P(xt=x o1:t) = P(o xt=x) P(xt=x o1:t-1) / Z i.e., bt = diag(co,:) bt-0.5 / Z Write Ao = diag(co,:) A It s enough to estimate Ao bt = Ao bt-1 / Z
35 23 Merging A & C HMM tracking: write bt = P[xt o1:t] P[xt o1:t-1] = bt-0.5 = A bt-1 P[ot o1:t-1] = C bt-0.5 if ot = o: where Z = P(ot=o bt-1) P(xt=x o1:t) = P(o xt=x) P(xt=x o1:t-1) / Z i.e., bt = diag(co,:) bt-0.5 / Z Write Ao = diag(co,:) A It s enough to estimate Ao bt = Ao bt-1 / Z P(ot=o bt-1) = 1 T Ao bt-1
36 24 HMM SSID: try #2 Σ o 2 def = E[o t+2 (δ o o t+1 )o t ] = E[E[o t+2 (δ o o t+1 )o t x t ]] = E[E[o t+2 (δ o o t+1 ) x t ]E[o t x t ]] = E[E[o t+2 x t,o t+1 = o]p[o t+1 = o x t ](Cx t ) ] = E[E[o t+2 x t,o t+1 = o](1 A o x t )(Cx t ) ] Ao x t (1 A o x t )(Cx t ) = E CA 1 A o x t = E[CAA o x t (Cx t ) ] = CAA o E[x t x t ]C = CAA o PC
37 25 HMM SSID: try #2 Â o def = U Σ o 2(U Σ 1 ) = U CAA o PC (U CAP C ) = (U CA)A o PC (PC ) (U CA) 1 = SA o S 1 x = Aox / P(o) o ~ Cx Co,: = e T Ao o 2 Estimate Σ1 and Σ from data; get Û = SVD(Σ1) Plug in to get Âo (for each o) Also need e = S -1 1 = leading left eigenvector of A1 + A2 +
38 Example: clock Discrete observations: sampled frames from training video when tracking: nearest neighbor or Parzen windows (mixture of Gaussians HMM) 10 latent dimensions 26
39 27 Can we generalize? HMMs had x Δ intuition: number of discrete states = number of dimensions We now have x SΔ essentially equally restrictive Can we allow x X for general X? # states > # dims
40 28 # states > # dims: the picture N=3 N=15 N=100 Random projections of N-dimensional simplex
41 29 SSID for OOMs PSRs without actions, multiplicity automata, x = Aox / P(o) x = Aox / P(o) OOM: o ~ Cx HMM: o ~ Cx Co,: = e T Ao Co,: = e T Ao OOM: defined by transition matrices Ao, normalization vector e like HMM, but lift restriction of X = SΔ instead of Aox 0, have Aox λx, λ 0 includes HMM as special case
42 30 OOM SSID No change!
43 31 OOM example No change! our HMM SSID was actually learning OOMs all along
44 Can we generalize? We ve allowed finer discretization of observation space Can we allow continuous observations? Yes: featurize! let ϕ(o) be a feature function 32
45 33 Featurize Σ φ 2 def = E[o t+2 φ(o t+1 )o t ] = o φ(o)e[o t+2 (δ o o t+1 )o t ] = o φ(o)caa o PC Â φ def = U Σ φ 2 (U Σ 1 ) = o φ(o)sa o S 1 Store Âϕ for many different ϕ, recover Âo as needed
46 Example: Range-only SLAM Robot measures distances to L landmarks as it moves; wants to reconstruct path and landmark locations T = 1000, L = 20, window = 1 obs, latent dimension = 15 Features = e d2 /2σ 2 34
47 Example: Range-only SLAM Robot measures distances to L landmarks as it moves; wants to reconstruct path and landmark locations T = 1000, L = 20, window = 1 obs, latent dimension = 15 Features = e d2 /2σ 2 34
48 35 Can we generalize? If some features are good, more must be better Kernels! Everything above is linear algebra Works just fine in an arbitrary RKHS Can rewrite in terms of Gram matrix (no infinite-d computations required) Caveat: regularization now more important
49 Avg. Prediction E Path Prediction performance 6 Avg. Prediction Err. Slot Ca Slot Car A. 7 6 Hilbert 5 Space Embeddin IMU x 10 4 B. 8 HMM 7 3 Mean RR-HMM Last Environment 6 2 LDS Embedded 5 IMU thanks to Dieter Fox s lab Hilbert Space Embeddings Hilbert of Hidden Markov Modelso Space Embeddings HMM RR-HMM Embedded Mean Last LDS Environment mated models and baselines. Geoff Gordon ARC colloquium Apr, 2011 horizon x Accuracy (%) Avg. Prediction Err. Avg.prediction Prediction Err. error Racetrack Predic B. A. B Figure measureme Racetrack 4. Slot car inertial Prediction Horizon U Figure 4. Slot car inertial measurement data. (A) The slot car platform and the IMU (top) and ar platform and the IMU (top) and the racetrack (bot(b) error Squared errorwith fordifferent predictio Pathfor prediction om).tom). (B) Squared estix 10 1 Example0Images HMM Mean RR-HMM Last 60 LDS Embedded Space 80 90Dimens Latent 65 Prediction Horizon horizon
50 37 Learning in-the-loop: option pricing pt pt 100 Price a financial derivative: psychic call holder gets to say I bought call 100 days ago underlying stock follows Black Scholes (unknown parameters) One solution: ID the B-S parameters, plan but planning is itself hard
51 38 Option pricing pt pt 100 A better solution [Van Roy et al.]: use RL 16 hand-picked features (e.g., poly history) initialize policy arbitrarily least-squares temporal differences (LSTD) to estimate value function policy := greedy; repeat
52 38 Option pricing pt pt 100 A better solution [Van Roy et al.]: use RL 16 hand-picked features (e.g., poly history) initialize policy arbitrarily least-squares temporal differences (LSTD) to estimate value function policy := greedy; repeat
53 39 Option pricing Still better: use SSID inside policy iteration 16 original features from Van Roy et al. 204 additional low-originality features e.g., linear fns of price history of underlying SSID picks best 16-d dynamics to explain feature evolution solve for value function in closed form
54 Policy iteration w/ spectral learning Expected payoff / $ invested Policy Iteration Iterations of PI Threshold LSTD (16) LSTD LARS-TD PSTD 0.82 /$ better than best competitor; 1.33 /$ better than best previously published data: 1,000,000 time steps 40
55 41 Making it fast Bottleneck: SVD of Gram or Hankel matrix G: (# time steps) 2 H: (# obs window length) (# time steps) E.g., 1 hr video, 24 fps, , 2 s window G: ( ) ( ) H: ( ) ( ) >> k = 50; n = 2000; % n 2 = >> tic; x = randn(n,n); [u,s,v] = svds(x,k); toc Elapsed time is seconds.
56 42 Making it fast Two techniques online learning random embedding Neither one new, but combination with PSR SSID is, and makes huge difference in practice
57 Online learning With each new observation, rank-1 update of: SVD (Brand) inverse (Sherman-Morrison) n features; latent dimension d; T steps space = O(nd): may fit in cache! time = O(nd 2 T): bounded time per example Small loss in statistical efficiency (estimated subspace rotates), but can deal with it Problem: no rank-1 update of KSVD! 43
58 44 Random embedding k(z) = p(ω)e jω z dω R d = p(ω) cos(ω z)dω R d = (1/Z) E ω Zp [cos(ω z)] x Often, k(x,y) = k(x y); PD Fourier transform of k(z) satisfies p(ω) 0 e.g., Gaussian, Laplacian, Cauchy ω R D R Looks like an expectation [Rahimi & Recht, 2007]
59 45 Random embedding cos(ω (x y)) = cos(ω x) cos(ω y) + sin(ω x) sin(ω y) cos(ω x) cos(ω y) = sin(ω x) sin(ω y) Features cos(ω x)/k, sin(ω x)/k for k random ωi ~ p φ(x) φ(y) = 1 k k cos(ω i (x y)) E[cos(ω (x y)] i=1 Convergence uniform in z as k
60 Random embedding example basis size
61 C. A. B. Table B. 0 0 first few steps C. les í í online+random: 100k features, 11k frames, limit = avail. data offline: 2k frames, compressed & subsampled, compute-limited Table final embedding (colors = 3rd dim) $IWHU samples After 600 A. p After 100 sam Results: Closing the loop 0 í í 0 47
62 Results: Closing the loop A. online+random: 100k features, 11k frames, limit = avail. data offline: 2k frames, compressed & subsampled, compute-limited Table A. final embedding (colors = 3rd dim) Table first few steps B. 0 0 HU mples C. í í 00 samples B. 0 í í 0 47
63 Results: Closing the loop A. online+random: 100k features, 11k frames, limit = avail. data offline: 2k frames, compressed & subsampled, compute-limited Table C. 0 After 600 samples Table B. 0 0 final embedding (colors = 3rd dim) í í first few steps B. A. í í 0 47
64 48 Planning image wheel velocities Suppose we can predict future state Choose actions to maximize reward
65 49 Planning Value iteration: exactly the same math as POMDP value iteration point-based methods are fast, accurate need # points exponential in latent dim: possible big win for learned models
66 Value iteration: Data Bird s Eye View goal image t =1 16x16 RGB t = 10 3d view observation at t = 10 observation at t =1 actions: 6 different noisy translations and rotations Data: 10,000 random start positions; traces of 7 random actions, observations, and rewards (+1 for goal, ϵ otherwise) 50
67 VI: Learned subspace 2 dimensions of learned 5D subspace points take average color of next image 51
68 VI: Plans goal goal Near-optimal plans, only 5d latent space (compare to 5-state POMDP) 52
69 53 Summary Learn dynamical system models with no local optima, fast online computation Nonparametric (kernel-based) version handles near-arbitrary observation distributions One general principle yields algorithms for HMMs, OOMs, SfM, range-only SLAM (kn. corr.), Kalman system ID, RL, Good results from a general-purpose algorithm on problems typically tackled by lots of engineering
70 Papers B. Boots and G. An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systems. AAAI, B. Boots and G. Predictive state temporal difference learning. NIPS, B. Boots, S. M. Siddiqi, and G. Closing the learning-planning loop with predictive state representations. RSS, L. Song, B. Boots, S. M. Siddiqi, G., and A. J. Smola. Hilbert space embeddings of hidden Markov models. ICML, (Best paper) 54
Learning about State. Geoff Gordon Machine Learning Department Carnegie Mellon University
Learning about State Geoff Gordon Machine Learning Department Carnegie Mellon University joint work with Byron Boots, Sajid Siddiqi, Le Song, Alex Smola What s out there?...... ot-2 ot-1 ot ot+1 ot+2 2
More informationAn Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systems
An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systems Byron Boots and Geoffrey J. Gordon AAAI 2011 Select Lab Carnegie Mellon University What is out there?...... o
More informationSpectral learning algorithms for dynamical systems
Spectral learning algorithms for dynamical systems Geoff Gordon http://www.cs.cmu.edu/~ggordon/ Machine Learning Department Carnegie Mellon University joint work with Byron Boots, Sajid Siddiqi, Le Song,
More informationReduced-Rank Hidden Markov Models
Reduced-Rank Hidden Markov Models Sajid M. Siddiqi Byron Boots Geoffrey J. Gordon Carnegie Mellon University ... x 1 x 2 x 3 x τ y 1 y 2 y 3 y τ Sequence of observations: Y =[y 1 y 2 y 3... y τ ] Assume
More informationA Constraint Generation Approach to Learning Stable Linear Dynamical Systems
A Constraint Generation Approach to Learning Stable Linear Dynamical Systems Sajid M. Siddiqi Byron Boots Geoffrey J. Gordon Carnegie Mellon University NIPS 2007 poster W22 steam Application: Dynamic Textures
More informationHilbert Space Embeddings of Hidden Markov Models
Hilbert Space Embeddings of Hidden Markov Models Le Song Carnegie Mellon University Joint work with Byron Boots, Sajid Siddiqi, Geoff Gordon and Alex Smola 1 Big Picture QuesJon Graphical Models! Dependent
More informationRandomized Algorithms
Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models
More informationHilbert Space Embeddings of Hidden Markov Models
Le Song lesong@cs.cmu.edu Byron Boots beb@cs.cmu.edu School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA Sajid M. Siddiqi siddiqi@google.com Google, Pittsburgh, PA 15213,
More informationKernel Bayes Rule: Nonparametric Bayesian inference with kernels
Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationCS8803: Statistical Techniques in Robotics Byron Boots. Hilbert Space Embeddings
CS8803: Statistical Techniques in Robotics Byron Boots Hilbert Space Embeddings 1 Motivation CS8803: STR Hilbert Space Embeddings 2 Overview Multinomial Distributions Marginal, Joint, Conditional Sum,
More informationCS181 Midterm 2 Practice Solutions
CS181 Midterm 2 Practice Solutions 1. Convergence of -Means Consider Lloyd s algorithm for finding a -Means clustering of N data, i.e., minimizing the distortion measure objective function J({r n } N n=1,
More informationLearning the Linear Dynamical System with ASOS ( Approximated Second-Order Statistics )
Learning the Linear Dynamical System with ASOS ( Approximated Second-Order Statistics ) James Martens University of Toronto June 24, 2010 Computer Science UNIVERSITY OF TORONTO James Martens (U of T) Learning
More informationChapter 3. Tohru Katayama
Subspace Methods for System Identification Chapter 3 Tohru Katayama Subspace Methods Reading Group UofA, Edmonton Barnabás Póczos May 14, 2009 Preliminaries before Linear Dynamical Systems Hidden Markov
More informationHilbert Space Embeddings of Predictive State Representations
Hilbert Space Embeddings of Predictive State Representations Byron Boots Computer Science and Engineering Dept. University of Washington Seattle, WA Arthur Gretton Gatsby Unit University College London
More informationPrincipal components analysis COMS 4771
Principal components analysis COMS 4771 1. Representation learning Useful representations of data Representation learning: Given: raw feature vectors x 1, x 2,..., x n R d. Goal: learn a useful feature
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationHilbert Space Embeddings of Hidden Markov Models
Carnegie Mellon University Research Showcase @ CMU Machine Learning Department School of Computer Science 6-00 Hilbert Space Embeddings of Hidden Markov Models Le Song Carnegie Mellon University Byron
More informationEstimating Covariance Using Factorial Hidden Markov Models
Estimating Covariance Using Factorial Hidden Markov Models João Sedoc 1,2 with: Jordan Rodu 3, Lyle Ungar 1, Dean Foster 1 and Jean Gallier 1 1 University of Pennsylvania Philadelphia, PA joao@cis.upenn.edu
More informationUnsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent
Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Dimensionality Reduction Goal:
More informationKernel-Based Contrast Functions for Sufficient Dimension Reduction
Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach
More informationMachine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.
Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationPredictive Timing Models
Predictive Timing Models Pierre-Luc Bacon McGill University pbacon@cs.mcgill.ca Borja Balle McGill University bballe@cs.mcgill.ca Doina Precup McGill University dprecup@cs.mcgill.ca Abstract We consider
More informationPartially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS
Partially Observable Markov Decision Processes (POMDPs) Pieter Abbeel UC Berkeley EECS Many slides adapted from Jur van den Berg Outline POMDPs Separation Principle / Certainty Equivalence Locally Optimal
More informationCS281 Section 4: Factor Analysis and PCA
CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More informationDiffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)
Diffeomorphic Warping Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) What Manifold Learning Isn t Common features of Manifold Learning Algorithms: 1-1 charting Dense sampling Geometric Assumptions
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 12 Jan-Willem van de Meent (credit: Yijun Zhao, Percy Liang) DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Linear Dimensionality
More informationLinear Dynamical Systems (Kalman filter)
Linear Dynamical Systems (Kalman filter) (a) Overview of HMMs (b) From HMMs to Linear Dynamical Systems (LDS) 1 Markov Chains with Discrete Random Variables x 1 x 2 x 3 x T Let s assume we have discrete
More informationLearning Low Dimensional Predictive Representations
Learning Low Dimensional Predictive Representations Matthew Rosencrantz MROSEN@CS.CMU.EDU Computer Science Department, Carnegie Mellon University, Forbes Avenue, Pittsburgh, PA, USA Geoff Gordon GGORDON@CS.CMU.EDU
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More informationHidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010
Hidden Markov Models Aarti Singh Slides courtesy: Eric Xing Machine Learning 10-701/15-781 Nov 8, 2010 i.i.d to sequential data So far we assumed independent, identically distributed data Sequential data
More informationIntroduction to Graphical Models
Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic
More informationLearning gradients: prescriptive models
Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan
More informationStatistical Machine Learning
Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x
More information26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.
10-708: Probabilistic Graphical Models, Spring 2015 26 : Spectral GMs Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G. 1 Introduction A common task in machine learning is to work with
More informationSampling Circulant Matrix Approach: A Comparison of Recent Kernel Matrix Approximation Techniques in Ridge Kernel Regression
Sampling Circulant Matrix Approach: A Comparison of Recent Kernel Matrix Approximation Techniques in Ridge Kernel Regression NP Slagle College of Computing Georgia Institute of Technology Atlanta, GA 30332
More informationCS 188: Artificial Intelligence
CS 188: Artificial Intelligence Hidden Markov Models Instructor: Anca Dragan --- University of California, Berkeley [These slides were created by Dan Klein, Pieter Abbeel, and Anca. http://ai.berkeley.edu.]
More informationLinear Dynamical Systems
Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations
More informationTUM 2016 Class 3 Large scale learning by regularization
TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond
More informationLecture 21: Spectral Learning for Graphical Models
10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation
More informationMachine Learning 4771
Machine Learning 4771 Instructor: ony Jebara Kalman Filtering Linear Dynamical Systems and Kalman Filtering Structure from Motion Linear Dynamical Systems Audio: x=pitch y=acoustic waveform Vision: x=object
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA Contents in latter part Linear Dynamical Systems What is different from HMM? Kalman filter Its strength and limitation Particle Filter
More informationLecture: Face Recognition and Feature Reduction
Lecture: Face Recognition and Feature Reduction Juan Carlos Niebles and Ranjay Krishna Stanford Vision and Learning Lab 1 Recap - Curse of dimensionality Assume 5000 points uniformly distributed in the
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationKernel methods for comparing distributions, measuring dependence
Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations
More informationFinal Exam, Fall 2002
15-781 Final Exam, Fall 22 1. Write your name and your andrew email address below. Name: Andrew ID: 2. There should be 17 pages in this exam (excluding this cover sheet). 3. If you need more room to work
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationKernel Methods. Barnabás Póczos
Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationMACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA
1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR
More informationMachine Learning - MT & 14. PCA and MDS
Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationKernel Learning via Random Fourier Representations
Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random
More informationLinear Regression (9/11/13)
STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter
More informationApproximate Kernel PCA with Random Features
Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,
More informationarxiv: v1 [cs.lg] 10 Jul 2012
A Spectral Learning Approach to Range-Only SLAM arxiv:1207.2491v1 [cs.lg] 10 Jul 2012 Byron Boots Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213 beb@cs.cmu.edu Abstract Geoffrey
More informationPrincipal Component Analysis CS498
Principal Component Analysis CS498 Today s lecture Adaptive Feature Extraction Principal Component Analysis How, why, when, which A dual goal Find a good representation The features part Reduce redundancy
More informationLECTURE NOTE #11 PROF. ALAN YUILLE
LECTURE NOTE #11 PROF. ALAN YUILLE 1. NonLinear Dimension Reduction Spectral Methods. The basic idea is to assume that the data lies on a manifold/surface in D-dimensional space, see figure (1) Perform
More informationKalman Filter Computer Vision (Kris Kitani) Carnegie Mellon University
Kalman Filter 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University Examples up to now have been discrete (binary) random variables Kalman filtering can be seen as a special case of a temporal
More informationFinal Exam, Machine Learning, Spring 2009
Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3
More informationPartially Observable Markov Decision Processes (POMDPs)
Partially Observable Markov Decision Processes (POMDPs) Sachin Patil Guest Lecture: CS287 Advanced Robotics Slides adapted from Pieter Abbeel, Alex Lee Outline Introduction to POMDPs Locally Optimal Solutions
More informationPartially observable systems and Predictive State Representation (PSR) Nan Jiang CS 598 Statistical UIUC
Partially observable systems and Predictive State Representation (PSR) Nan Jiang CS 598 Statistical RL @ UIUC Partially observable systems Key assumption so far: Markov property (Markovianness) 2 Partially
More informationCollaborative Filtering. Radek Pelánek
Collaborative Filtering Radek Pelánek 2017 Notes on Lecture the most technical lecture of the course includes some scary looking math, but typically with intuitive interpretation use of standard machine
More informationLearning a Linear Dynamical System Model for Spatiotemporal
Learning a Linear Dynamical System Model for Spatiotemporal Fields Using a Group of Mobile Sensing Robots Xiaodong Lan a, Mac Schwager b, a Department of Mechanical Engineering, Boston University, Boston,
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationarxiv: v2 [stat.ml] 28 Feb 2018
An Efficient, Expressive and Local Minima-free Method for Learning Controlled Dynamical Systems Ahmed Hefny Carnegie Mellon University Carlton Downey Carnegie Mellon University Geoffrey Gordon Carnegie
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationCSE 473: Artificial Intelligence
CSE 473: Artificial Intelligence Hidden Markov Models Dieter Fox --- University of Washington [Most slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials
More informationHidden Markov models
Hidden Markov models Charles Elkan November 26, 2012 Important: These lecture notes are based on notes written by Lawrence Saul. Also, these typeset notes lack illustrations. See the classroom lectures
More informationLinear Least-squares Dyna-style Planning
Linear Least-squares Dyna-style Planning Hengshuai Yao Department of Computing Science University of Alberta Edmonton, AB, Canada T6G2E8 hengshua@cs.ualberta.ca Abstract World model is very important for
More informationTwo-Manifold Problems with Applications to Nonlinear System Identification
Two-Manifold Problems with Applications to Nonlinear System Identification Byron Boots Geoffrey J. Gordon Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 1213 beb@cs.cmu.edu ggordon@cs.cmu.edu
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationDimensionality Reduction
Lecture 5 1 Outline 1. Overview a) What is? b) Why? 2. Principal Component Analysis (PCA) a) Objectives b) Explaining variability c) SVD 3. Related approaches a) ICA b) Autoencoders 2 Example 1: Sportsball
More informationLecture: Face Recognition and Feature Reduction
Lecture: Face Recognition and Feature Reduction Juan Carlos Niebles and Ranjay Krishna Stanford Vision and Learning Lab Lecture 11-1 Recap - Curse of dimensionality Assume 5000 points uniformly distributed
More informationLaplacian Agent Learning: Representation Policy Iteration
Laplacian Agent Learning: Representation Policy Iteration Sridhar Mahadevan Example of a Markov Decision Process a1: $0 Heaven $1 Earth What should the agent do? a2: $100 Hell $-1 V a1 ( Earth ) = f(0,1,1,1,1,...)
More informationState-Space Methods for Inferring Spike Trains from Calcium Imaging
State-Space Methods for Inferring Spike Trains from Calcium Imaging Joshua Vogelstein Johns Hopkins April 23, 2009 Joshua Vogelstein (Johns Hopkins) State-Space Calcium Imaging April 23, 2009 1 / 78 Outline
More informationIntroduction to Machine Learning
Introduction to Machine Learning 3. Instance Based Learning Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 Outline Parzen Windows Kernels, algorithm Model selection
More informationAdvanced Introduction to Machine Learning
10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationLecture 12: Algorithms for HMMs
Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 17 October 2016 updated 9 September 2017 Recap: tagging POS tagging is a
More informationSTA 414/2104: Lecture 8
STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA
More informationEE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6)
EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6) Gordon Wetzstein gordon.wetzstein@stanford.edu This document serves as a supplement to the material discussed in
More information7 Principal Component Analysis
7 Principal Component Analysis This topic will build a series of techniques to deal with high-dimensional data. Unlike regression problems, our goal is not to predict a value (the y-coordinate), it is
More informationLeast Squares and Kalman Filtering Questions: me,
Least Squares and Kalman Filtering Questions: Email me, namrata@ece.gatech.edu Least Squares and Kalman Filtering 1 Recall: Weighted Least Squares y = Hx + e Minimize Solution: J(x) = (y Hx) T W (y Hx)
More informationGaussian Processes as Continuous-time Trajectory Representations: Applications in SLAM and Motion Planning
Gaussian Processes as Continuous-time Trajectory Representations: Applications in SLAM and Motion Planning Jing Dong jdong@gatech.edu 2017-06-20 License CC BY-NC-SA 3.0 Discrete time SLAM Downsides: Measurements
More informationIntroduction to Gaussian Process
Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression
More informationPCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given
More informationKernel methods for Bayesian inference
Kernel methods for Bayesian inference Arthur Gretton Gatsby Computational Neuroscience Unit Lancaster, Nov. 2014 Motivating Example: Bayesian inference without a model 3600 downsampled frames of 20 20
More informationLEARNING DYNAMIC SYSTEMS: MARKOV MODELS
LEARNING DYNAMIC SYSTEMS: MARKOV MODELS Markov Process and Markov Chains Hidden Markov Models Kalman Filters Types of dynamic systems Problem of future state prediction Predictability Observability Easily
More informationA Factorization Method for 3D Multi-body Motion Estimation and Segmentation
1 A Factorization Method for 3D Multi-body Motion Estimation and Segmentation René Vidal Department of EECS University of California Berkeley CA 94710 rvidal@eecs.berkeley.edu Stefano Soatto Dept. of Computer
More informationA Constraint Generation Approach to Learning Stable Linear Dynamical Systems
A Constraint Generation Approach to Learning Stable Linear Dynamical Systems Sajid M Siddiqi Robotics Institute Carnegie-Mellon University Pittsburgh, PA 15213 siddiqi@cscmuedu Byron Boots Computer Science
More informationECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering
ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Siwei Guo: s9guo@eng.ucsd.edu Anwesan Pal:
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationarxiv: v1 [cs.lg] 12 Dec 2009
Closing the Learning-Planning Loop with Predictive State Representations Byron Boots Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213 beb@cs.cmu.edu Sajid M. Siddiqi Robotics
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationA Constraint Generation Approach to Learning Stable Linear Dynamical Systems
A Constraint Generation Approach to Learning Stable Linear Dynamical Systems Sajid M. Siddiqi Robotics Institute Carnegie-Mellon University Pittsburgh, PA 15213 siddiqi@cs.cmu.edu Byron Boots Computer
More informationPCA, Kernel PCA, ICA
PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per
More information