From Non-Negative Matrix Factorization to Deep Learning

Size: px
Start display at page:

Download "From Non-Negative Matrix Factorization to Deep Learning"

Transcription

1 The Math!! From Non-Negative Matrix Factorization to Deep Learning Intuitions and some Math too! October 18, 2017

2 The Math!! Introduction

3 Disclaimer Introduction The Math!! I really just wanted to talk about NMF I added Deep Learning to the title just to attract people Conclusion: Deep Learning actually works!

4 Relational Data Introduction The Math!! We are given data about two sets of entities: X = x 1, x 2, x 3 x m Y = y 1, y 2, y 3 x n which may co-occur under some observable event We are also given the number of times c (count) c N 0 + : 0, 1, 2 when x i and y j co-occurred under such condition: (x i, y j, c ij )

5 The Math!! Relational Data - (x i, y j, c ij ) Typical cases: User x i watched Movie y j (c = 0/1) Customer x i purchased Product y j n times Web User x i clicked on Ad/Page y j (c = 0/1) Web User x i searched for on Query y j c times But there are also cases where X and Y are the same type (or even set) of entities word x i occurs with word y j c times protein x i interacts with protein y j (c = 0/1)

6 The Math!! Relational Data - Representation We can consider (x i, y j, n ij ) as a list of tuple Or, given an appropriate indexing of the entities: X : x i with i = 1, 2, 3 m Y : y j with j = 1, 2, 3 n We can represent all the observations as a Matrix, C X Y { }} { c 11 c 12 c 1n c 21 c 22 c 2n c m1 c m2 c mn = C

7 The Math!! Relational Data - Matrix Representation A few (practical) observations about C: it contains count data (non-negative!) it probably contains many zeros: most possible events do not happen (count = 0) not very efficient representation Probably a very large matrix hundreds times hundreds (eg eg countries countries ) millions times millions (eg Amazon users products) Interestingly, under some conditions, C can still be used to make recommendations / predictions

8 The Math!! Users who did this also did that Classic way of making predictions using this sort of data Each user is represented by a row vector from the matrix C: x 1 = [c 11, c 12 c 1n ] x i = [c i1, c i2 c in ] x i = [c m1, c m2 c mn ] We can perform an all vs all comparison among x i vectors using some convenient vector distance metric (eg cosine) For each user x i we build the list of top k most similar users: x i S i = [x a, x b, x c ] Recommendation Strategy: For each users x i, recommend items (eg movies, products) that the most similar users x j S i have already interacted with, but which x i has not yet interacted with

9 The Math!! Users who did this also did that Easy to understand, but it breaks on high dimensional data If the space of items is very large then the probability: overlap between any two vectors is very low regions of overlap being mostly the same is high of a spurious/meaningless co-occurrence tends to be over-emphasized Think of Amazon Catalog (millions of products): lot s of people buy the best selling dish soap: it s not a meaningful overlap the fact that two people bought the same rare french movie DVD does not mean that they should use the same shampoo

10 A very shallow story Introduction The Math!! One important semantic observation : Matrix C tells a very shallow story C merely stores an observations: it stores a flat relation (eg user x 147 watched movie y 1023 ) C tells nothing about the process underlying the observations Therefore, C can t: explain the data in compact way model the underlying data generation process But we know that there is a process: there is a chain of events that leads to the observations

11 We know that Introduction The Math!! Movies can be categorized in certain genres g 1, g 2 g k : comedy, action, horror, sci-fi, etc a movie can belong to several genres we can assume there is a degree of belongingness there are way less genres (10 s) than movies (1000 s) Users have different preferences for / affinities to those genres Users, typically, choose a movie of genres they like even if within the genre their choice is rather random

12 The Math!! Any resemblance with reality is pure ML modeling So, here s a slightly more reasonable story: 1 user x i has a certain affinity to each movie genre g j 2 user x i decides to to watch a movie of a specific genre g k (depending on the affinity he/she has to each genre) 3 from all movies y i of genre g k user picks one (with certain probability depending on the affinity of that movie to that genre) Instead of the flat user x watched move y relation, we now have two relations: user genres (preferences / affinities) movies genres (belongingness) And the story goes like: observations: user genres movies

13 The Math!! Oh wait!! Did you say relation? But relations can be stored in matrices! So, in this model we have two additional matrices! k genres { }} { k genres u 11 u 1k { }} { u 21 u 2k m 11 m 1k m users m 21 m 2k n movies m n1 m nk u m1 u mk M U Both U and M are very thin and tall matrices If k in the number of genres: k m (nr users) k n (nr movies)

14 The Math!! Putting all the pieces together So, our story was observations: user genres movies

15 The Math!! Putting all the pieces together So, our story was: observations: (user genres) movies This part is matrix U (m users): genres { }} { u 11 u 1k u 21 u 2k users u m1 u mk U

16 The Math!! Putting all the pieces together n movies So, our story was: observations: user (genres movies) This part is not matrix M (n movies) It is it s transpose, M T : k genres {}}{ m 11 m 1k m 21 m 2k m n1 m nk M k genres M T is a long and short matrix (many columns!) n movies {}}{ m 11 m 21 m n1 m 1k m 2k m nk M T

17 The Math!! Putting all the pieces together Then, if we assume linearity, our story: observations: (user (genres) movies) can be mapped to: C : U M T c 11 c 12 c 1n u 11 u 1k c 21 c 22 c 2n u 21 u 2k : c 1n c 2n c mn u m1 u mk } {{ } } {{ } m users n movies m users k genres (m n) : (m k) (k n) m 11 m 21 m n1 m 1k m 2k m nk } {{ } k genres n movies

18 The Math!! Let s dive deeper on C : U M T c 11 c 12 c 1n c 21 c 22 c 2n c 1n c 2n c mn : u 11 u 1k u 21 u 2k u m1 u mk m 11 m 21 m n1 m 1k m 2k m nk c 12 - how user 1 relates to movie 2 (count) u 1[1k] - vector describing affinity of user 1 to k genres m 2[1k] - vector describing affinity of movie 2 to k genres c 22 = u 1[1k] m 2[1k] linear model: looks great! the higher the overlap between genres, the higher counts

19 The Math!! Here is something interesting u 11 u 1k u 21 u 2k u m1 u mk }{{} m users k genres m 11 m 21 m n1 m 1k m 2k m nk }{{} k genres n movies Nr of parameters in : U & M T : m k + k n = (m + n) k C: m n

20 The Math!! In practice k m and k n : [ ] } {{ } k genres n movies } {{ } k genres n movies }{{} m users k genres So: (m + n) k m n Eg: 100k users, 10k movies, 200 genres C : 100k 1k = 1000M cells / params U & M T : ( )k 200 = 22M 50 compression

21 Less is More Introduction The Math!! We are explaining m n observations using only (m + n) k params This is good: we have a compact / lower dimensional model! We are concentrating the information in less parameters Our model for recommendation is now based on a deeper story! The link between users and movies is made via genres This helps solving the problems related with sparsity! We can try to find the most similar users : but instead of using sparse vectors from C (n dims) vs we use the lower dimensional dense vectors from U (k dims) vs

22 Less is More (?) Introduction The Math!! Except we have a little problem!

23 Less is More (?) Introduction The Math!! In fact, we have two little problems

24 The Math!! Problem #1: List of Genres For most things we may not have a list of existing genres Maybe for movies we have some sets, but even in that case: it is not obvious to define what those are (committee) they are new genres coming in (eg zombie ) definitions may change over time (concept drift, specialization) It is not trivial to assign items (users and movies) to genres : typically categories are subject to multiple interpretations we many not even have information to make the assignment How would we know in advance the profile of each user

25 The Math!! Problem #2: How many Genres are enough? This is a more technical issue but the intuition is easy Extreme case: we have only 1 defined movie genre movies are either of that genre or not ( undefinable movies) users have affinity to the genre or not( undefinable movies) Then, we can only explain simple observation patterns [ ]

26 Adding Genres Introduction The Math!! We can add genres (and continuous levels of belongingness): Adding just one genre provides more explanatory power [ ] The observation matrix has now a more complex pattern

27 Adding Genres Introduction The Math!! We can add even more genres and this will allows us to explain even more complex patterns In fact, if we add enough genres, we are guaranteed to find factorization that explain observation matrix For example, assume the observation matrix is:

28 The Math!! Here s an obvious factorization k = } {{ } Works as a row selector } {{ } replicates the observation matrix This is not an interesting factorization we are using it just to illustrate an intuition But, notice that two of the rows are equal

29 The Math!! So, we don t actually need so many genres k = } {{ } Works as a row selector } {{ } distinct rows of the observation matrix we only need to choose among 3 distinct rows row-oriented perspective: matrix on the right is a basis of rows that can be composed / selected to build each row of the data matrix

30 The Math!! So, we don t actually need so many genres (II) k = }{{} Basis of columns to be composed }{{} coefficients to compose output column vectors we are now composing 3 distinct columns column-oriented perspective: matrix on the left is a basis of columns that can be composed / selected to build each column of the data matrix

31 More generally Introduction The Math!! In fact if we set nr of genres k to: with: k min(m, n) m (number of rows / users) n (number of columns / users) we are guaranteed to find a factorization that fully explains the observations even if that factorization is pretty boring

32 And? Introduction The Math!! But the really interesting cases happen when: k min(m, n) In such cases, we may: not be able to fully recover the observation matrix but still be able to recover a close enough one In this case, we get a compact representation of users and movies, that does not suffer from the sparsity problems

33 The Math!! So, how do we make this happen? if we don t have a list of genres if we don t even know how many genres we need if we just have a matrix with observations Who do we make this happen???

34 The Math!! The Math!!

35 Intro and Terminology Introduction The Math!! I am going to jump directly into them math We already went through the intuitions I am going to follow the notation used in Wikipedia 1 This will make easier any further study using Wikipedia And makes my life much easier too! So: observation matrix C V user matrix U W movie matrix M H Let s forget the user movies scenario to think abstractly But still think about some intuitions 1 https: //enwikipediaorg/wiki/non-negative_matrix_factorization

36 The Math!! Some notation and conventions Historically NMF formulations follow the column oriented perspective } {{ } V: data matrix }{{} W: dictionary / features / basis matrix }{{} H: activation / coefficients matrix We build each column of V by using the k coefficients in H to combined the k available columns of W And we want, for a given k (number of categories): V WH

37 V WH Introduction The Math!! For a given k, there is a Residual: R k = V WH which is a matrix of the same dimensions of a matrix V Wish to residue to be the smallest possible under a given matrix norm min R k = min V WH subject to W 0, H 0 W,H W,H Fro now, the only restrictions on W and H are: all element are positive are m k and k n but we could impose more restriction (eg W T W = I )

38 The Math!! Choosing a Norm to Measure Residual Matrices are complex objects, so there are many possible ways of sizing them through various norms There are many matrix norms, so there are many way for us to measure the size of the residual matrix R k = V WH The norm we choose will define how we weigh the errors of approximation WH of V And we can have several assumption about how errors can be measured: Is having many small errors better that having any large error? Do errors become disproportionately worse with amplitude? Do we care about how many errors we make or only about their amplitudes? This leads to many interesting (and fundamental) questions, but I am going to skip all this!

39 Frobenius Norm Introduction The Math!! A very intuitive norm is the Frobenius Norm: m n R F = rij 2 i=1 j=1 It s simply the sum of the square of all cells in the matrix Two other formulations: Matrix operators formulation: R F = trace(r T R) min{m,n} If σ i (R) are singular values of R: R F = i=1 σi 2(R) R F only vanishes if all rij 2 = 0, that is when V = WH

40 But Introduction The Math!! The square root usually leads to nasty calculus Much more very convenient is the square of the Frobenius Norm: m n R 2 F = rij 2 i=1 j=1 We got rid of the square root, but the minimizers of R 2 F and R F are still the same Remember that we want to minimize min W,H R k The derivatives of R 2 F only involve deriving powers of 2: we will need to take derivatives to compute the minimizers these are easy!

41 The Math!! Let s get back to our minimization problem We want to minimize (now using square of Frobenius norm) subject to W 0, H 0 min W,H R 2 F = min V W,H WH 2 F We have two unknowns W and H So we need to find two variables but we only have one equation There is not even an unique solution: WH = (WB)(B 1 H) which means that any (complementary) permutation or scaling of W or H will leads to the same result

42 The Math!! Let s get back to our minimization problem We want to minimize (now using square of Frobenius norm) min W,H R 2 F = min V W,H WH 2 F Interestingly, if we only had one unknown: eg having W and trying to find H eg having H and trying to find W we would fall in a very well-known case: Ordinary Least Squares ( Linear Regression with squared loss ) for which there is a closed form 2 solution! 2 I am not showing it here, but this closed form solution is possible because we have an easy way of computing the derivatives

43 The Math!! Ordinary Least Squares (OLS) Suppose that we knew W and wanted to find H that minimized the error: min H R 2 F = min V H WH 2 F The OLS solution for H would be: H = (W T W ) 1 W T V with (W T W ) 1 W T being the left pseudo-inverse of W (Moore-Penrose inverse)

44 The Math!! Ordinary Least Squares (OLS) Suppose now that we knew H and wanted to find W that minimized the error: min W The OLS solution W would be: R 2 F = min V W WH 2 F W T = (HH T ) 1 HV T again with (HH T ) 1 H being the left pseudo-inverse of H (Moore-Penrose inverse)

45 The Math!! Ordinary Least Squares (OLS) This looks very complicated but it is just: matrix transposition matrix multiplication and matrix inversion of (small) k k matrices: W is m k so W T W is k k H is k n so HH T is k k It s just a shame that we don t have any of the W or H matrices! But as with many things in life

46 The Math!! We can fake it until we make it! And that s ok! No need to feel embarrassed about it!

47 The Math!! Alternate Least Squares (ALS) This is an algorithm of a larger class of algorithms that works by making leaps of faith We are going to assume that we know one of the matrices and compute the other using OLS ALS Algorithm: Generate Random W 0 with positive entries Repeat n=1: 1 compute H n = OLS(V, W n 1) 2 correct negative entries in H n 3 compute W n = OLS(V, H n) 4 correct negative entries in W n until R 2 F is small enough or max iterations is reached

48 The Math!! Alternate Least Squares (ALS) It s very fast (+) heavy operation would be the inversion but we are inverting only k by k matrices It is easy to understand in practice and to implement (+) Works well in practice (+) ie finds one of solution to he optimization problem) Does not ensure non-negativeness (-) we have to force correction Does not allow control on the structure of W and H (-) sparseness, orthogonality

49 The Math!! So, how do we do it better? There are other ways of computing NMF We can formulate the problem with regularization constraints sparseness, orthogonality, dependencies, blocks Use many well known Gradient Descent strategies Better ways of performing updates (multiplicative updates) We can scale to big data using stochastic approaches There are smart ways of performing computations on GPU s There are probabilistic interpretations of NMF There are ways of dealing with missing data etc There is 30 years or more of research in NMF we are not going to talk about all of it now

50 The Math!!

51 Key Points Introduction The Math!! NMF is a powerful modeling tool It transforms a set of X, Y observation stored in a matrix in a story The story has new set of k entities in between X and Y Users - Movies : User - Genres - Movies This implies factorizing the observation matrix in two matrices: one matrix stores information X and k entities another matrix stores information Y and k entities We showed that we can compute the factorization; ALS is just one of the ways of performing such factorization

52 Key Points Introduction The Math!! By factorizing we build a story that is closer to reality: observations do not just happen Our model now include an underlying (yet unobserved) set of factor that helps explaining the observation (eg the genres) This means our predictions are not dependent of specific observations: we can ground them on these new unobserved parameters they are more explanatory and more stable (changing one observation does not change the underlying parameters too much) So, this is good!

53 So Introduction The Math!! Can we build models that involve even deeper concepts and more intricate relations between them? Yes For example we can had more hierarchy Users Groups of Users Genres Groups of Movies Movies This would amount to decomposing C in 4 Matrices! We are going deeper!

54 Is this Deep Learning? Introduction The Math!! Is this Deep Learning? Yes and No depending on who you ask! But there is a deep 3 connection between (Non-negative) Matrix Factorization and Deep Learning Multi-Layer Perceptron architectures They both try to add explanatory entities in between the inputs and outputs! 3 No pun intended!

55 However Introduction The Math!! That s for another talk!

56 And now it s time for Introduction The Math!! Questions!

57 The Math!! Thank you! Connect with me on LinkedIn

Jeffrey D. Ullman Stanford University

Jeffrey D. Ullman Stanford University Jeffrey D. Ullman Stanford University 2 Often, our data can be represented by an m-by-n matrix. And this matrix can be closely approximated by the product of two matrices that share a small common dimension

More information

Singular Value Decompsition

Singular Value Decompsition Singular Value Decompsition Massoud Malek One of the most useful results from linear algebra, is a matrix decomposition known as the singular value decomposition It has many useful applications in almost

More information

Recommendation Systems

Recommendation Systems Recommendation Systems Popularity Recommendation Systems Predicting user responses to options Offering news articles based on users interests Offering suggestions on what the user might like to buy/consume

More information

Data Mining Recitation Notes Week 3

Data Mining Recitation Notes Week 3 Data Mining Recitation Notes Week 3 Jack Rae January 28, 2013 1 Information Retrieval Given a set of documents, pull the (k) most similar document(s) to a given query. 1.1 Setup Say we have D documents

More information

Introduction to Algebra: The First Week

Introduction to Algebra: The First Week Introduction to Algebra: The First Week Background: According to the thermostat on the wall, the temperature in the classroom right now is 72 degrees Fahrenheit. I want to write to my friend in Europe,

More information

Dimensionality Reduction

Dimensionality Reduction 394 Chapter 11 Dimensionality Reduction There are many sources of data that can be viewed as a large matrix. We saw in Chapter 5 how the Web can be represented as a transition matrix. In Chapter 9, the

More information

Quick Introduction to Nonnegative Matrix Factorization

Quick Introduction to Nonnegative Matrix Factorization Quick Introduction to Nonnegative Matrix Factorization Norm Matloff University of California at Davis 1 The Goal Given an u v matrix A with nonnegative elements, we wish to find nonnegative, rank-k matrices

More information

The Gram-Schmidt Process

The Gram-Schmidt Process The Gram-Schmidt Process How and Why it Works This is intended as a complement to 5.4 in our textbook. I assume you have read that section, so I will not repeat the definitions it gives. Our goal is to

More information

Linear Least-Squares Data Fitting

Linear Least-Squares Data Fitting CHAPTER 6 Linear Least-Squares Data Fitting 61 Introduction Recall that in chapter 3 we were discussing linear systems of equations, written in shorthand in the form Ax = b In chapter 3, we just considered

More information

Solving with Absolute Value

Solving with Absolute Value Solving with Absolute Value Who knew two little lines could cause so much trouble? Ask someone to solve the equation 3x 2 = 7 and they ll say No problem! Add just two little lines, and ask them to solve

More information

Do students sleep the recommended 8 hours a night on average?

Do students sleep the recommended 8 hours a night on average? BIEB100. Professor Rifkin. Notes on Section 2.2, lecture of 27 January 2014. Do students sleep the recommended 8 hours a night on average? We first set up our null and alternative hypotheses: H0: μ= 8

More information

An Intuitive Introduction to Motivic Homotopy Theory Vladimir Voevodsky

An Intuitive Introduction to Motivic Homotopy Theory Vladimir Voevodsky What follows is Vladimir Voevodsky s snapshot of his Fields Medal work on motivic homotopy, plus a little philosophy and from my point of view the main fun of doing mathematics Voevodsky (2002). Voevodsky

More information

Math 308 Midterm Answers and Comments July 18, Part A. Short answer questions

Math 308 Midterm Answers and Comments July 18, Part A. Short answer questions Math 308 Midterm Answers and Comments July 18, 2011 Part A. Short answer questions (1) Compute the determinant of the matrix a 3 3 1 1 2. 1 a 3 The determinant is 2a 2 12. Comments: Everyone seemed to

More information

Gaussian Quiz. Preamble to The Humble Gaussian Distribution. David MacKay 1

Gaussian Quiz. Preamble to The Humble Gaussian Distribution. David MacKay 1 Preamble to The Humble Gaussian Distribution. David MacKay Gaussian Quiz H y y y 3. Assuming that the variables y, y, y 3 in this belief network have a joint Gaussian distribution, which of the following

More information

MITOCW ocw f99-lec09_300k

MITOCW ocw f99-lec09_300k MITOCW ocw-18.06-f99-lec09_300k OK, this is linear algebra lecture nine. And this is a key lecture, this is where we get these ideas of linear independence, when a bunch of vectors are independent -- or

More information

Final Review Sheet. B = (1, 1 + 3x, 1 + x 2 ) then 2 + 3x + 6x 2

Final Review Sheet. B = (1, 1 + 3x, 1 + x 2 ) then 2 + 3x + 6x 2 Final Review Sheet The final will cover Sections Chapters 1,2,3 and 4, as well as sections 5.1-5.4, 6.1-6.2 and 7.1-7.3 from chapters 5,6 and 7. This is essentially all material covered this term. Watch

More information

Differential Equations

Differential Equations This document was written and copyrighted by Paul Dawkins. Use of this document and its online version is governed by the Terms and Conditions of Use located at. The online version of this document is

More information

Lecture 2: Linear regression

Lecture 2: Linear regression Lecture 2: Linear regression Roger Grosse 1 Introduction Let s ump right in and look at our first machine learning algorithm, linear regression. In regression, we are interested in predicting a scalar-valued

More information

Math101, Sections 2 and 3, Spring 2008 Review Sheet for Exam #2:

Math101, Sections 2 and 3, Spring 2008 Review Sheet for Exam #2: Math101, Sections 2 and 3, Spring 2008 Review Sheet for Exam #2: 03 17 08 3 All about lines 3.1 The Rectangular Coordinate System Know how to plot points in the rectangular coordinate system. Know the

More information

Lecture 4: Applications of Orthogonality: QR Decompositions

Lecture 4: Applications of Orthogonality: QR Decompositions Math 08B Professor: Padraic Bartlett Lecture 4: Applications of Orthogonality: QR Decompositions Week 4 UCSB 204 In our last class, we described the following method for creating orthonormal bases, known

More information

Vector, Matrix, and Tensor Derivatives

Vector, Matrix, and Tensor Derivatives Vector, Matrix, and Tensor Derivatives Erik Learned-Miller The purpose of this document is to help you learn to take derivatives of vectors, matrices, and higher order tensors (arrays with three dimensions

More information

Talk Science Professional Development

Talk Science Professional Development Talk Science Professional Development Transcript for Grade 4 Scientist Case: The Heavy for Size Investigations 1. The Heavy for Size Investigations, Through the Eyes of a Scientist We met Associate Professor

More information

CSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides

CSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides CSE 494/598 Lecture-4: Correlation Analysis LYDIA MANIKONDA HT TP://WWW.PUBLIC.ASU.EDU/~LMANIKON / **Content adapted from last year s slides Announcements Project-1 Due: February 12 th 2016 Analysis report:

More information

Descriptive Statistics (And a little bit on rounding and significant digits)

Descriptive Statistics (And a little bit on rounding and significant digits) Descriptive Statistics (And a little bit on rounding and significant digits) Now that we know what our data look like, we d like to be able to describe it numerically. In other words, how can we represent

More information

Algebra. Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

Algebra. Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed. This document was written and copyrighted by Paul Dawkins. Use of this document and its online version is governed by the Terms and Conditions of Use located at. The online version of this document is

More information

Collaborative Filtering Matrix Completion Alternating Least Squares

Collaborative Filtering Matrix Completion Alternating Least Squares Case Study 4: Collaborative Filtering Collaborative Filtering Matrix Completion Alternating Least Squares Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 19, 2016

More information

Lecture - 24 Radial Basis Function Networks: Cover s Theorem

Lecture - 24 Radial Basis Function Networks: Cover s Theorem Neural Network and Applications Prof. S. Sengupta Department of Electronic and Electrical Communication Engineering Indian Institute of Technology, Kharagpur Lecture - 24 Radial Basis Function Networks:

More information

1 Review of the dot product

1 Review of the dot product Any typographical or other corrections about these notes are welcome. Review of the dot product The dot product on R n is an operation that takes two vectors and returns a number. It is defined by n u

More information

Collaborative Filtering. Radek Pelánek

Collaborative Filtering. Radek Pelánek Collaborative Filtering Radek Pelánek 2017 Notes on Lecture the most technical lecture of the course includes some scary looking math, but typically with intuitive interpretation use of standard machine

More information

Sniffing out new laws... Question

Sniffing out new laws... Question Sniffing out new laws... How can dimensional analysis help us figure out what new laws might be? (Why is math important not just for calculating, but even just for understanding?) (And a roundabout way

More information

MATRIX MULTIPLICATION AND INVERSION

MATRIX MULTIPLICATION AND INVERSION MATRIX MULTIPLICATION AND INVERSION MATH 196, SECTION 57 (VIPUL NAIK) Corresponding material in the book: Sections 2.3 and 2.4. Executive summary Note: The summary does not include some material from the

More information

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal

More information

Linear Independence Reading: Lay 1.7

Linear Independence Reading: Lay 1.7 Linear Independence Reading: Lay 17 September 11, 213 In this section, we discuss the concept of linear dependence and independence I am going to introduce the definitions and then work some examples and

More information

Ordinary Least Squares Linear Regression

Ordinary Least Squares Linear Regression Ordinary Least Squares Linear Regression Ryan P. Adams COS 324 Elements of Machine Learning Princeton University Linear regression is one of the simplest and most fundamental modeling ideas in statistics

More information

CHAPTER 7: TECHNIQUES OF INTEGRATION

CHAPTER 7: TECHNIQUES OF INTEGRATION CHAPTER 7: TECHNIQUES OF INTEGRATION DAVID GLICKENSTEIN. Introduction This semester we will be looking deep into the recesses of calculus. Some of the main topics will be: Integration: we will learn how

More information

Iterative solvers for linear equations

Iterative solvers for linear equations Spectral Graph Theory Lecture 15 Iterative solvers for linear equations Daniel A. Spielman October 1, 009 15.1 Overview In this and the next lecture, I will discuss iterative algorithms for solving linear

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 18: Latent Semantic Indexing Hinrich Schütze Center for Information and Language Processing, University of Munich 2013-07-10 1/43

More information

Mathematical Methods in Engineering and Science Prof. Bhaskar Dasgupta Department of Mechanical Engineering Indian Institute of Technology, Kanpur

Mathematical Methods in Engineering and Science Prof. Bhaskar Dasgupta Department of Mechanical Engineering Indian Institute of Technology, Kanpur Mathematical Methods in Engineering and Science Prof. Bhaskar Dasgupta Department of Mechanical Engineering Indian Institute of Technology, Kanpur Module - I Solution of Linear Systems Lecture - 05 Ill-Conditioned

More information

Linear Algebra, Summer 2011, pt. 2

Linear Algebra, Summer 2011, pt. 2 Linear Algebra, Summer 2, pt. 2 June 8, 2 Contents Inverses. 2 Vector Spaces. 3 2. Examples of vector spaces..................... 3 2.2 The column space......................... 6 2.3 The null space...........................

More information

Lecture 4: Constructing the Integers, Rationals and Reals

Lecture 4: Constructing the Integers, Rationals and Reals Math/CS 20: Intro. to Math Professor: Padraic Bartlett Lecture 4: Constructing the Integers, Rationals and Reals Week 5 UCSB 204 The Integers Normally, using the natural numbers, you can easily define

More information

Solving Quadratic & Higher Degree Equations

Solving Quadratic & Higher Degree Equations Chapter 7 Solving Quadratic & Higher Degree Equations Sec 1. Zero Product Property Back in the third grade students were taught when they multiplied a number by zero, the product would be zero. In algebra,

More information

Real Analysis Prof. S.H. Kulkarni Department of Mathematics Indian Institute of Technology, Madras. Lecture - 13 Conditional Convergence

Real Analysis Prof. S.H. Kulkarni Department of Mathematics Indian Institute of Technology, Madras. Lecture - 13 Conditional Convergence Real Analysis Prof. S.H. Kulkarni Department of Mathematics Indian Institute of Technology, Madras Lecture - 13 Conditional Convergence Now, there are a few things that are remaining in the discussion

More information

CMU CS 462/662 (INTRO TO COMPUTER GRAPHICS) HOMEWORK 0.0 MATH REVIEW/PREVIEW LINEAR ALGEBRA

CMU CS 462/662 (INTRO TO COMPUTER GRAPHICS) HOMEWORK 0.0 MATH REVIEW/PREVIEW LINEAR ALGEBRA CMU CS 462/662 (INTRO TO COMPUTER GRAPHICS) HOMEWORK 0.0 MATH REVIEW/PREVIEW LINEAR ALGEBRA Andrew ID: ljelenak August 25, 2018 This assignment reviews basic mathematical tools you will use throughout

More information

Artificial Neural Networks 2

Artificial Neural Networks 2 CSC2515 Machine Learning Sam Roweis Artificial Neural s 2 We saw neural nets for classification. Same idea for regression. ANNs are just adaptive basis regression machines of the form: y k = j w kj σ(b

More information

MITOCW ocw nov2005-pt1-220k_512kb.mp4

MITOCW ocw nov2005-pt1-220k_512kb.mp4 MITOCW ocw-3.60-03nov2005-pt1-220k_512kb.mp4 PROFESSOR: All right, I would like to then get back to a discussion of some of the basic relations that we have been discussing. We didn't get terribly far,

More information

AN ALGEBRA PRIMER WITH A VIEW TOWARD CURVES OVER FINITE FIELDS

AN ALGEBRA PRIMER WITH A VIEW TOWARD CURVES OVER FINITE FIELDS AN ALGEBRA PRIMER WITH A VIEW TOWARD CURVES OVER FINITE FIELDS The integers are the set 1. Groups, Rings, and Fields: Basic Examples Z := {..., 3, 2, 1, 0, 1, 2, 3,...}, and we can add, subtract, and multiply

More information

Singular Value Decomposition

Singular Value Decomposition Chapter 6 Singular Value Decomposition In Chapter 5, we derived a number of algorithms for computing the eigenvalues and eigenvectors of matrices A R n n. Having developed this machinery, we complete our

More information

Using SVD to Recommend Movies

Using SVD to Recommend Movies Michael Percy University of California, Santa Cruz Last update: December 12, 2009 Last update: December 12, 2009 1 / Outline 1 Introduction 2 Singular Value Decomposition 3 Experiments 4 Conclusion Last

More information

Lecture 10: Powers of Matrices, Difference Equations

Lecture 10: Powers of Matrices, Difference Equations Lecture 10: Powers of Matrices, Difference Equations Difference Equations A difference equation, also sometimes called a recurrence equation is an equation that defines a sequence recursively, i.e. each

More information

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA) CS68: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA) Tim Roughgarden & Gregory Valiant April 0, 05 Introduction. Lecture Goal Principal components analysis

More information

TAYLOR POLYNOMIALS DARYL DEFORD

TAYLOR POLYNOMIALS DARYL DEFORD TAYLOR POLYNOMIALS DARYL DEFORD 1. Introduction We have seen in class that Taylor polynomials provide us with a valuable tool for approximating many different types of functions. However, in order to really

More information

5.4 Continuity: Preliminary Notions

5.4 Continuity: Preliminary Notions 5.4. CONTINUITY: PRELIMINARY NOTIONS 181 5.4 Continuity: Preliminary Notions 5.4.1 Definitions The American Heritage Dictionary of the English Language defines continuity as an uninterrupted succession,

More information

Math 304 Handout: Linear algebra, graphs, and networks.

Math 304 Handout: Linear algebra, graphs, and networks. Math 30 Handout: Linear algebra, graphs, and networks. December, 006. GRAPHS AND ADJACENCY MATRICES. Definition. A graph is a collection of vertices connected by edges. A directed graph is a graph all

More information

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi Lecture - 41 Pulse Code Modulation (PCM) So, if you remember we have been talking

More information

MITOCW R11. Double Pendulum System

MITOCW R11. Double Pendulum System MITOCW R11. Double Pendulum System The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for

More information

Machine Learning for Signal Processing Sparse and Overcomplete Representations

Machine Learning for Signal Processing Sparse and Overcomplete Representations Machine Learning for Signal Processing Sparse and Overcomplete Representations Abelino Jimenez (slides from Bhiksha Raj and Sourish Chaudhuri) Oct 1, 217 1 So far Weights Data Basis Data Independent ICA

More information

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper)

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper) PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper) In class, we saw this graph, with each node representing people who are following each other on Twitter: Our

More information

Estimation of the Click Volume by Large Scale Regression Analysis May 15, / 50

Estimation of the Click Volume by Large Scale Regression Analysis May 15, / 50 Estimation of the Click Volume by Large Scale Regression Analysis Yuri Lifshits, Dirk Nowotka [International Computer Science Symposium in Russia 07, Ekaterinburg] Presented by: Aditya Menon UCSD May 15,

More information

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016 AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework

More information

[Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty.]

[Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty.] Math 43 Review Notes [Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty Dot Product If v (v, v, v 3 and w (w, w, w 3, then the

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 8: Evaluation & SVD Paul Ginsparg Cornell University, Ithaca, NY 20 Sep 2011

More information

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Neural Networks 2. 2 Receptive fields and dealing with image inputs CS 446 Machine Learning Fall 2016 Oct 04, 2016 Neural Networks 2 Professor: Dan Roth Scribe: C. Cheng, C. Cervantes Overview Convolutional Neural Networks Recurrent Neural Networks 1 Introduction There

More information

+ y = 1 : the polynomial

+ y = 1 : the polynomial Notes on Basic Ideas of Spherical Harmonics In the representation of wavefields (solutions of the wave equation) one of the natural considerations that arise along the lines of Huygens Principle is the

More information

A thorough derivation of back-propagation for people who really want to understand it by: Mike Gashler, September 2010

A thorough derivation of back-propagation for people who really want to understand it by: Mike Gashler, September 2010 A thorough derivation of back-propagation for people who really want to understand it by: Mike Gashler, September 2010 Define the problem: Suppose we have a 5-layer feed-forward neural network. (I intentionally

More information

MITOCW MITRES18_006F10_26_0602_300k-mp4

MITOCW MITRES18_006F10_26_0602_300k-mp4 MITOCW MITRES18_006F10_26_0602_300k-mp4 FEMALE VOICE: The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational

More information

Chapter 1 Review of Equations and Inequalities

Chapter 1 Review of Equations and Inequalities Chapter 1 Review of Equations and Inequalities Part I Review of Basic Equations Recall that an equation is an expression with an equal sign in the middle. Also recall that, if a question asks you to solve

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

base 2 4 The EXPONENT tells you how many times to write the base as a factor. Evaluate the following expressions in standard notation.

base 2 4 The EXPONENT tells you how many times to write the base as a factor. Evaluate the following expressions in standard notation. EXPONENTIALS Exponential is a number written with an exponent. The rules for exponents make computing with very large or very small numbers easier. Students will come across exponentials in geometric sequences

More information

17 Neural Networks NEURAL NETWORKS. x XOR 1. x Jonathan Richard Shewchuk

17 Neural Networks NEURAL NETWORKS. x XOR 1. x Jonathan Richard Shewchuk 94 Jonathan Richard Shewchuk 7 Neural Networks NEURAL NETWORKS Can do both classification & regression. [They tie together several ideas from the course: perceptrons, logistic regression, ensembles of

More information

CS60021: Scalable Data Mining. Dimensionality Reduction

CS60021: Scalable Data Mining. Dimensionality Reduction J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Dimensionality Reduction Sourangshu Bhattacharya Assumption: Data lies on or near a

More information

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are

More information

Lecture 4: Training a Classifier

Lecture 4: Training a Classifier Lecture 4: Training a Classifier Roger Grosse 1 Introduction Now that we ve defined what binary classification is, let s actually train a classifier. We ll approach this problem in much the same way as

More information

CSC321 Lecture 7: Optimization

CSC321 Lecture 7: Optimization CSC321 Lecture 7: Optimization Roger Grosse Roger Grosse CSC321 Lecture 7: Optimization 1 / 25 Overview We ve talked a lot about how to compute gradients. What do we actually do with them? Today s lecture:

More information

Gaussian elimination

Gaussian elimination Gaussian elimination October 14, 2013 Contents 1 Introduction 1 2 Some definitions and examples 2 3 Elementary row operations 7 4 Gaussian elimination 11 5 Rank and row reduction 16 6 Some computational

More information

Quadratic Equations Part I

Quadratic Equations Part I Quadratic Equations Part I Before proceeding with this section we should note that the topic of solving quadratic equations will be covered in two sections. This is done for the benefit of those viewing

More information

DIFFERENTIAL EQUATIONS

DIFFERENTIAL EQUATIONS DIFFERENTIAL EQUATIONS Basic Concepts Paul Dawkins Table of Contents Preface... Basic Concepts... 1 Introduction... 1 Definitions... Direction Fields... 8 Final Thoughts...19 007 Paul Dawkins i http://tutorial.math.lamar.edu/terms.aspx

More information

CIS 2033 Lecture 5, Fall

CIS 2033 Lecture 5, Fall CIS 2033 Lecture 5, Fall 2016 1 Instructor: David Dobor September 13, 2016 1 Supplemental reading from Dekking s textbook: Chapter2, 3. We mentioned at the beginning of this class that calculus was a prerequisite

More information

9 Searching the Internet with the SVD

9 Searching the Internet with the SVD 9 Searching the Internet with the SVD 9.1 Information retrieval Over the last 20 years the number of internet users has grown exponentially with time; see Figure 1. Trying to extract information from this

More information

Deep Algebra Projects: Algebra 1 / Algebra 2 Go with the Flow

Deep Algebra Projects: Algebra 1 / Algebra 2 Go with the Flow Deep Algebra Projects: Algebra 1 / Algebra 2 Go with the Flow Topics Solving systems of linear equations (numerically and algebraically) Dependent and independent systems of equations; free variables Mathematical

More information

Guide to Proofs on Sets

Guide to Proofs on Sets CS103 Winter 2019 Guide to Proofs on Sets Cynthia Lee Keith Schwarz I would argue that if you have a single guiding principle for how to mathematically reason about sets, it would be this one: All sets

More information

Structured matrix factorizations. Example: Eigenfaces

Structured matrix factorizations. Example: Eigenfaces Structured matrix factorizations Example: Eigenfaces An extremely large variety of interesting and important problems in machine learning can be formulated as: Given a matrix, find a matrix and a matrix

More information

Notes 11: OLS Theorems ECO 231W - Undergraduate Econometrics

Notes 11: OLS Theorems ECO 231W - Undergraduate Econometrics Notes 11: OLS Theorems ECO 231W - Undergraduate Econometrics Prof. Carolina Caetano For a while we talked about the regression method. Then we talked about the linear model. There were many details, but

More information

Please bring the task to your first physics lesson and hand it to the teacher.

Please bring the task to your first physics lesson and hand it to the teacher. Pre-enrolment task for 2014 entry Physics Why do I need to complete a pre-enrolment task? This bridging pack serves a number of purposes. It gives you practice in some of the important skills you will

More information

Iterative Methods for Solving A x = b

Iterative Methods for Solving A x = b Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http

More information

Introduction to Logistic Regression

Introduction to Logistic Regression Introduction to Logistic Regression Guy Lebanon Binary Classification Binary classification is the most basic task in machine learning, and yet the most frequent. Binary classifiers often serve as the

More information

The Haar Wavelet Transform: Compression and. Reconstruction

The Haar Wavelet Transform: Compression and. Reconstruction The Haar Wavelet Transform: Compression and Damien Adams and Halsey Patterson December 14, 2006 Abstract The Haar Wavelet Transformation is a simple form of compression involved in averaging and differencing

More information

Chapter 26: Comparing Counts (Chi Square)

Chapter 26: Comparing Counts (Chi Square) Chapter 6: Comparing Counts (Chi Square) We ve seen that you can turn a qualitative variable into a quantitative one (by counting the number of successes and failures), but that s a compromise it forces

More information

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Maximum Entropy Models I Welcome back for the 3rd module

More information

1 Maintaining a Dictionary

1 Maintaining a Dictionary 15-451/651: Design & Analysis of Algorithms February 1, 2016 Lecture #7: Hashing last changed: January 29, 2016 Hashing is a great practical tool, with an interesting and subtle theory too. In addition

More information

Lecture 9: September 28

Lecture 9: September 28 0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation? Did You Mean Association Or Correlation? AP Statistics Chapter 8 Be careful not to use the word correlation when you really mean association. Often times people will incorrectly use the word correlation

More information

Singular Value Decomposition

Singular Value Decomposition Chapter 5 Singular Value Decomposition We now reach an important Chapter in this course concerned with the Singular Value Decomposition of a matrix A. SVD, as it is commonly referred to, is one of the

More information

( )( b + c) = ab + ac, but it can also be ( )( a) = ba + ca. Let s use the distributive property on a couple of

( )( b + c) = ab + ac, but it can also be ( )( a) = ba + ca. Let s use the distributive property on a couple of Factoring Review for Algebra II The saddest thing about not doing well in Algebra II is that almost any math teacher can tell you going into it what s going to trip you up. One of the first things they

More information

MITOCW watch?v=pqkyqu11eta

MITOCW watch?v=pqkyqu11eta MITOCW watch?v=pqkyqu11eta The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To

More information

Determinants of 2 2 Matrices

Determinants of 2 2 Matrices Determinants In section 4, we discussed inverses of matrices, and in particular asked an important question: How can we tell whether or not a particular square matrix A has an inverse? We will be able

More information

MITOCW ocw f99-lec05_300k

MITOCW ocw f99-lec05_300k MITOCW ocw-18.06-f99-lec05_300k This is lecture five in linear algebra. And, it will complete this chapter of the book. So the last section of this chapter is two point seven that talks about permutations,

More information

Optimization and Gradient Descent

Optimization and Gradient Descent Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function

More information

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Lecture 21 K - Nearest Neighbor V In this lecture we discuss; how do we evaluate the

More information

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

10-725/36-725: Convex Optimization Spring Lecture 21: April 6 10-725/36-725: Conve Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 21: April 6 Scribes: Chiqun Zhang, Hanqi Cheng, Waleed Ammar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information