Lecture 2: Linear Algebra Review

Similar documents
Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CS 6375 Machine Learning

Math for ML: review. CS 1675 Introduction to ML. Administration. Lecture 2. Milos Hauskrecht 5329 Sennott Square, x4-8845

Course in Data Science

Deep Learning Book Notes Chapter 2: Linear Algebra

Basic Concepts in Linear Algebra

Machine Learning Basics

Math for ML: review. ML and knowledge of other fields

Review of Linear Algebra

Review of Basic Concepts in Linear Algebra

Matrices and Vectors

M.A.P. Matrix Algebra Procedures. by Mary Donovan, Adrienne Copeland, & Patrick Curry

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Multivariate Statistics Fundamentals Part 1: Rotation-based Techniques

We choose parameter values that will minimize the difference between the model outputs & the true function values.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

INTRODUCTION TO DATA SCIENCE

Linear Algebra. Introduction. Marek Petrik 3/23/2017. Many slides adapted from Linear Algebra Lectures by Martin Scharlemann

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Statistical Pattern Recognition

Gradient Descent. Dr. Xiaowei Huang

Linear and Logistic Regression. Dr. Xiaowei Huang

Linear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4

B553 Lecture 5: Matrix Algebra Review

Matrix Algebra: Definitions and Basic Operations

Lecture 2: Linear Algebra Review

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

What is semi-supervised learning?

MAC Module 2 Systems of Linear Equations and Matrices II. Learning Objectives. Upon completing this module, you should be able to :

Lecture II: Linear Algebra Revisited

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

MATH 323 Linear Algebra Lecture 6: Matrix algebra (continued). Determinants.

Linear Algebra Solutions 1

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

An Introduction to Matrix Algebra

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Final Exam, Machine Learning, Spring 2009

Lecture 15: Random Projections

Linear Algebra for Machine Learning. Sargur N. Srihari

Matrices: 2.1 Operations with Matrices

CS 246 Review of Linear Algebra 01/17/19

Linear algebra and applications to graphs Part 1

7.6 The Inverse of a Square Matrix

Introduction to Machine Learning

Large Scale Data Analysis Using Deep Learning

Linear Algebra review Powers of a diagonalizable matrix Spectral decomposition

10-701/ Recitation : Linear Algebra Review (based on notes written by Jing Xiang)

Linear Regression (continued)

Elementary Matrices. MATH 322, Linear Algebra I. J. Robert Buchanan. Spring Department of Mathematics

LEC 5: Two Dimensional Linear Discriminant Analysis

b 1 b 2.. b = b m A = [a 1,a 2,...,a n ] where a 1,j a 2,j a j = a m,j Let A R m n and x 1 x 2 x = x n

CS281 Section 4: Factor Analysis and PCA

Linear Algebra Massoud Malek

Scientific Computing: Dense Linear Systems

Knowledge Discovery and Data Mining 1 (VO) ( )

Conceptual Questions for Review

Logic and machine learning review. CS 540 Yingyu Liang

Linear Algebra: Characteristic Value Problem

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Lecture 17: Estimating and Manipulating Covariance

Lecture #21. c T x Ax b. maximize subject to

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat

Mathematical foundations - linear algebra

Machine Learning Practice Page 2 of 2 10/28/13

CS100: DISCRETE STRUCTURES. Lecture 3 Matrices Ch 3 Pages:

Tensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia

Chapter 1: Systems of linear equations and matrices. Section 1.1: Introduction to systems of linear equations

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Statistical Machine Learning

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Some practical remarks (recap) Statistical Natural Language Processing. Today s lecture. Linear algebra. Why study linear algebra?

CS249: ADVANCED DATA MINING

CPSC 540: Machine Learning

Chapter 3 Transformations

Machine learning for pervasive systems Classification in high-dimensional spaces

Matrices. Chapter Definitions and Notations

Linear Algebra Review. Vectors

Linear Algebra review Powers of a diagonalizable matrix Spectral decomposition

CS145: INTRODUCTION TO DATA MINING

Review of linear algebra

Machine Learning for Data Science (CS4786) Lecture 11

Discriminative Direction for Kernel Classifiers

Classification 2: Linear discriminant analysis (continued); logistic regression

A Introduction to Matrix Algebra and the Multivariate Normal Distribution

Multivariate Statistical Analysis

Lecture 3 Linear Algebra Background

Preliminary Linear Algebra 1. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics / 100

MATH 320, WEEK 7: Matrices, Matrix Operations

Eigenvalues and Eigenvectors

1. General Vector Spaces

Metric-based classifiers. Nuno Vasconcelos UCSD

MAC Module 1 Systems of Linear Equations and Matrices I

Math Spring 2011 Final Exam

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Columbus State Community College Mathematics Department Public Syllabus

6.036 midterm review. Wednesday, March 18, 15

CS6220: DATA MINING TECHNIQUES

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

IFT 6760A - Lecture 1 Linear Algebra Refresher

Transcription:

CS 4980/6980: Introduction to Data Science c Spring 2018 Lecture 2: Linear Algebra Review Instructor: Daniel L. Pimentel-Alarcón Scribed by: Anh Nguyen and Kira Jordan This is preliminary work and has not been reviewed by instructor. If you have comments about typos, errors, notation inconsistencies, etc., please email Anh Nguyen and Kira Jordan at anguyen139@student.gsu.edu, kjordan32@student.gsu.edu. 2.1 Lecture One: Intro to Data Science Recap Data Science combines the disciplines of mathematics and computer science to gain meaningful insight from data sets. We often use tables to organize the data we are studying. Each row represents a data feature (ranging from 1 to D) and each column represents a vector (ranging from 1 to N). Feature 1 Feature 2 Feature 3. Feature D Sample1 Sample 2 Sample 3 Sample N Table 2.1: Data Table Layout We normally use X to denote a D N data matrix, and use x i to denote the i th [column] vector in X. We use x i R D to indicate that x i is a vector with D real-valued entries. 2.2 Data Models To compose a complete data model, we must select the most appropriate model (line, bar graph, etc) and model learning process. Determining the best model and learning process is dependent on the data that we are analyzing.there are two types of model learning processes- supervised learning and unsupervised learning. (a) Supervised Learning (i) Uses labels (categorical data) (ii) Example: Multiple photographs are labeled by the subject s name. Our goal is to be able to identify the proper label for any new photograph. 2-1

Lecture 2: Linear Algebra Review 2-2 Figure 2.1: The photos are our data, and the subject names are the labels (iii) Other applications include- identifying dog breed based on images, diagnosing diseases based on MRI scans, etc. (iv) Algorithms for Supervised Learning: Logistic, Linear, and Polynomial Regression Support Vector Machine (SVM) K Nearest Neighbors Random Forest Neural Networks Decision Trees Naive Bayes (b) Unsupervised Learning (i) Does not use labels (data is uncategorized) (ii) Example: We are given numerous shuffled images and we do not know who any of the subjects are. Our goal is to be able to identify that the 1st picture and the 7th picture are the same subject. Figure 2.2: We can see that the far left image is the same as the image second from the right. (iii) Another application for unsupervised learning is video surveillance. (iv) Algorithms for Unsupervised Learning:

Lecture 2: Linear Algebra Review 2-3 K Means Clustering Hierarchical Expectation Management (EM) Principal Component Analysis (PCA) Correlation Analysis 2.3 Linear Algebra: Fundamentals (a) A vector is a single column which contains D number of features. A vector is denoted with a bold, lowercase letter (i.e. x). A non-bolded lowercase letter denotes a scaler value. Below is the standard notation for a vector of the i th index with five dimensions (D values). x 1 x 2 x i = x 3 x 4 x 5 (b) A matrix is a collection of N number of vectors. A matrix is denoted with a bold, uppercase letter (i.e. X). An individual element in a matrix s j th row and i th column is denoted (j,i). Below is the standard notation for a 4 by 5 matrix (four representing the number of features and 5 representing the number of vectors). x 1,1 x 1,2 x 1,3 x 1,4 X = x 2,1 x 2,2 x 2,3 x 2,4 x 3,1 x 3,2 x 3,3 x 3,4 x 4,1 x 4,2 x 4,3 x 4,4 2.4 Linear Algebra: Operations (a) To transpose a vector or a matrix means to switch its rows and columns. We use x T or X T to denote the transpose operation. 1 x = 2 3 x T = [ 1 2 3 4 ] 1 2 X = 3 4 X T = 5 6 4 [ 1 3 ] 5 2 4 6

Lecture 2: Linear Algebra Review 2-4 (b) To trace a matrix means to take the sum of its diagonal elements (beginning at the top left element and going down to the bottom right). This operation is only applicable to square matrices. The notation of the trace of a matrix is represented as tr(x) = n i=1 x ii = x 11 + x 22 +... + x nn Example: 1 2 3 4 X = 5 6 7 8 9 10 11 12 13 14 15 16 tr(x) = n i=1 x ii = 1 + 6 + 11 + 16 = 34 (c) A square matrix (X) is said to be invertible if: X X 1 = I To invert a (2 x 2) matrix X, swap the positions of x 1,1 and x 2,2, make x 1,2 and x 2,1 negative, and divide by the determinant value- which is computed as: ((x 11 x 2,2 )) (x 1,2 x 2,1 ). [ ] X 1 x11 x = 12 x 21 x 22 = 1 X [ ] x22 x 12 x 21 x 11 (d) We are able to multiply vectors and matrices by themselves or by scaler values. (i) Vectors: We multiply two vectors of the same size together to get the inner product (or dot product). Given two vectors, x and y, we multiply y by x T to get the value of x y. 4 3 x = 7 y = 0 2 1 x y = x T y x T = [ 4 7 2 ] [ 4 7 2 ] 3 0 = ((4 3) + (7 0) + (2 1)) = 14 1 In scaler multiplication, we can multiply each element in a vector by the given scaler value to result in a new vector. 0 4 0 0 4 6 2 = 4 6 4 2 = 24 8 4 4 4 16

Lecture 2: Linear Algebra Review 2-5 (ii) Matrices: It is only possible to multiply two matrices if the number of columns in the first matrix is equal to the number of rows in the second matrix. 4 6 14 8 5 0 11 79 Not Applicable for Multiplication: 15 0 3 62 6 22 3 6 8 1 0 7 34 45 0 58 17 28 [ ] Applicable for Multiplication: 3 65 3 0 61 49 29 3 0 21 Example: 1 2 [ ] (1 7 + 2 10) (1 8 + 2 11) (1 9 + 2 12) 27 30 33 3 4 7 8 9 = (3 7 + 4 10) (3 8 + 4 11) (3 9 + 4 12) = 61 68 75 10 11 12 5 6 (5 7 + 6 10) (5 8 + 6 11) (5 9 + 6 12) 95 106 117 Scaler multiplication on matrices works the same as it does on vectors. Simply multiply each element of the matrix by the scaler value to result in a matrix. Example: 2 [ ] [ 3 0 6 2 3 2 0 2 6 = 4 2 3 2 4 2 2 2 3 ] = [ 6 0 ] 12 8 4 6 2.5 Linear Algebra: Norms A vector s norm is used as a means to quantify how big or small a vector is. There are different methods of finding a norm value. General form of norm for vector X R D is X p = ( D j=1 Xp j ) 1 p For case p = 2, it becomes Euclidean distance or the length the vector X in R D X 2 = For case p = 1, it is the sum of length of each dimension. X 2 = D j=1 X j Matrix norm L2 norm X 2 := largest eigenvalue of X Frobenius Norm X F := i=1 N D D j=1 Xp j Hence, it is also called Manhattan distance. j=1 X2 ij In data science, the norm value plays a critical role in determining how different or how similar our data is.

Lecture 2: Linear Algebra Review 2-6 2.6 Optimization Review Basic terms: x = argmaxf(x) means for all x in domain of function f, find value x such that f(x ) is the maximum. x x = argminf(x) means for all x in domain of function f, find value x such that f(x ) is the minimum. x Case 1: Simple convex/concave function with one optimal solution Figure 2.3: function y = x 2 has one minimum at x = 0 Steps to find optimal point for convex/concave function: Step 1: Find df dx Step 2: Solve df dx = 0 Case 2: Multivariable convex function In this case, find minimum/maximum of f(x) where X is a matrix. X := argmax X R DxN f(x) Step 1: Calculate gradient Step 2: Set gradient to zero and find solution for the equation (f(x)) = f X =<.., f,.. > (2.1) X i, (f(x)) = 0 (2.2) Gradient for multivariable function is complicated, there is some special cases such as: Case 3: A function which has multiple critical points (tr(ax)) = A (2.3) (tr(x T AX)) = 2X T A (2.4)

Lecture 2: Linear Algebra Review 2-7 Figure 2.4: The function above has multiple solutions when its derivative is equal to zero When the equation df dx (X) = 0 has multiple solutions, finding the global minimum or maximum is an open problem with no exact solution. For one variable functions, the approximated function could be reconstructed by many sampling many points. However, for multivariable functions with high dimensions, sampling is expensive. Instead, the gradient descent/ascent approach is used to find the optimal point in the case of convex/concave functions or the local optimal point in case of functions with multiple critical points. The idea is the gradient vector points to the steepest direction, then by going to the inverse direction of gradient vector, we ascend toward the local minimum. X i+1 = X i α (f(x i )) (2.5)