Unsupervised Learning: Projections

Size: px

Start display at page:

Download "Unsupervised Learning: Projections"

Melvin Berry
6 years ago
Views:

1 Unsupervised Learning: Projections CMPSCI 689 Fall 2015 Sridhar Mahadevan Lecture 2

2 Data, Data, Data LIBS spectrum Steel'drum' The Image Classification Challenge: 1,000 object classes 1,431,167 images Output: Scale T-shirt Steel drum Drumstick Mud turtle ' Output: Scale T-shirt Giant panda Drumstick Mud turtle ' Russakovsky et al. arxiv, 2014 Fei-Fei Li & Andrej Karpathy! Indus script Lecture 1 -!23'!! 5"Jan"15' Atari game

3 Data, Data, Data Spanish speakers Physics Graph of articles from Wikipedia 0 French speakers

4 Fundamental Challenges Diversity: data comes in different forms High-dimensionality: the number of dimensions may be very large Noise: real-world data is inherently stochastic Hidden regularities: how to uncover the latent underlying structure in data?

5 Mathematical Modeling A variety of mathematical formalisms can help us extract deep structure from data Linear algebra: map data onto a vector space Statistics: assume data is generated from some probability distribution Optimization: fit some smooth parametric model General principle: project data onto some regularity, and see how much structure is captured

6 Machine Learning Statistics Linear algebra Optimization

7 Goals of unsupervised learning Compression: minimize reconstruction error Reveal hidden regularities in data Find a new representation of the data Facilitate subsequent processing of the data Clustering, classification, ranking, regression

8 Linear Algebraic Approaches Singular value decomposition Vector quantization, clustering Eigenvalue decomposition Non-negative matrix factorization Matrix completion

9 Statistical Approaches Maximum likelihood and Bayesian methods Mixture of models, EM algorithm Hidden Markov models Density estimation, Manifold learning

10 Optimization Methods Deep learning using auto encoders Deep belief networks, restricted Boltzmann machines Recursive neural nets, LSTM Decision trees, kd-trees

11 Compression Consider, for example, representing some motion data of a robot in 3-dimensions We can represent the state of the robot as a fourdimensional vector s(i) = (x(i), y(i), z(i), t(i)) We are given a large dataset D of these vectors We want to compress the data and extract structure

12 Robot motion data In this case, the motion data has 12 dimensions Why can this data be compressed?

13 Finding a Good Basis Abstract definition Given a dataset D of vectors in R N Find the smallest number k of basis vectors that can efficiently represent D (where k << N) Suppose we choose k=1 What is the best one-dimensional representation?

14 Projection onto a vector v w 1 = (1,1, 1) Example: Project the vector (3, 2, -1, 1, -1, 2) onto 1 Answer:?

15 General Principle of Projection v (v w) T x =0 (v x) T x =0 w v-w x v T x = x T x = vt x x T x The error vector v-w must be perpendicular to x

16 Random Variables Random Variables A random variable X is a function from a sample space S into the real numbers R. We denote the value of the variable by X(s) for element s S of the sample space. ArandomvariableX induces a probability function P X : S X where X is the range of X. P X (X = x i )=P ({s j S : X(s j )=x i }) Example: 10 random variables in the diabetes data set: age, sex, BMI, BP, serum measurements. p.20/48

17 Projections and Statistics How do we relate projections, a geometrical idea, to statistics? Think of a random variable x (e.g., temperature in Amherst) as a vector of values (85,80,75,81, ) One simple way to understand random variables is to find their mean value E(x) Geometrically, what is E(x)?

18 Projections and Statistics x μ 1 = (1,1, 1) P n i=1 x i µ = n

19 Projections and Statistics x 1 = (1,1, 1) μ What about the error vector x - μ 1? Another fundamental way to model random variables is through their variance 2 = 1 n kx µ1k2 = P n i=1 (x i µ) 2 n

20 Expectation and Mean and Variance Variance The expected value (or mean ) µ X of a random variable X is µ X = E(X) = xf X(x)dx : Xis continuous x X xp (X = x) : Xis discrete The variance (or average squared deviation from mean ) of a random variable X is Var(X) =E(X µ X ) 2 The positive square root of the variance is defined as the standard deviation σ X. p.21/48

21 Expectation: Properties Properties of Expectation Linearity: E(a 1 X a n X n )=a 1 E(X 1 )+...+ a n E(X n ) Nested expectation: E(E(X)) = E(X) Expected deviation around mean is 0: E(X µ X )=0 Exercise: Show var(x) =E(X 2 ) µ 2 X Given a set of independent random variables X 1,...,X n, Var(a 1 X a n X n )=a 2 1Var(X 1 )+...+ a 2 nvar(x n ) p.23/48

objective function f(λ) that is typically smooth and

22 Projections and Optimization Let us know bring in the perspective of optimization Here, we are given some objective function f(λ) that is typically smooth and differentiable Our goal is to solve the problem =minf( )

Convex Loss Convex Loss Functions and Robustness 347 Functions10.6 Loss Hastie, Tibshirani, Friedman, Stat. Learning Loss 0.0 0.5 1.0 1.5 2.0 2.5 3.

23 Convex Loss Convex Loss Functions and Robustness 347 Functions10.6 Loss Hastie, Tibshirani, Friedman, Stat. Learning Loss Misclassification Exponential Binomial Deviance Squared Error Support Vector y f FIGURE Loss functions for two-class classification. The response is y = ±1; the prediction is f, with class prediction sign(f). The losses are misclassification: I(sign(f) y); exponential: exp( yf); binomial deviance: log(1 + exp( 2yf)); squared error: (y f) 2 ; and support vector: (1 yf) + (see Section 12.3). Each function has been scaled so that it passes through the point (0, 1).

24 Projections and Optimization Solve: min E(x ) 2 Lemma : = E(x) Proof :E(x E(x)+E(x) ) 2 = E(x E(x)) 2 + E(E(x) )) 2 + 2E(x E(x))E(E(x) )) Complete the proof!

25 Sample vs. Sample Statistics Population Statistics Sample statistics refers to properties computed from the data. Population statistics refers to the properties of the underlying distribution. Given a random variable X, wheren samples x 1,...,x n are given. Sample mean = x = 1 n n i=1 x i Sample variance = s xx = 1 n n j=1 (x j x) 2

26 Projections in Hilbert Spaces Hilbert spaces are commonly used in machine learning and engineering They are infinite-dimensional vector spaces The concept of projection is defined in Hilbert space through the notion of an inner product This generalizes the dot product in finitedimensions

27 Inner Products Given two vectors x, y in a Hilbert space H The inner product is denoted <x,y> It satisfies a few key properties Nonnegativity: <x, x> >= 0 (if 0, x must be 0 vector) Linearity: <x + z, y> = <x, y> + <x, y> Scalar multiple: <a x, y> = a <x, y> Symmetry: <x, y> = <y, x>

28 Defining Distance in Hilbert Spaces Norm = kxk = p hx, xi Norm = kxk = p hx, xi Hilbert space of Probability Distributions: Z hp(x),q(x)i = E(p(x)q(x)) = p(x)q(x)f X (x)dx

29 Summary Datasets used in machine learning are highly diverse Unsupervised learning tries to generate a compressed version of the original data Projections are a fundamental way to find compressed representations Hilbert spaces are infinite-dimensional vector spaces Deep relations between projections, statistics, and optimization, forms basis for much work in machine learning

Unsupervised Learning: Dimensionality Reduction

Unsupervised Learning: Dimensionality Reduction CMPSCI 689 Fall 2015 Sridhar Mahadevan Lecture 3 Outline In this lecture, we set about to solve the problem posed in the previous lecture Given a dataset,