CSC411: Final Review. James Lucas & David Madras. December 3, 2018

Similar documents
CS 6375 Machine Learning

CS534 Machine Learning - Spring Final Exam

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Pattern Recognition and Machine Learning

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

CS6220: DATA MINING TECHNIQUES

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Kernel methods, kernel SVM and ridge regression

Data Mining Techniques

Machine Learning for Data Science (CS4786) Lecture 12

CSC 411 Lecture 12: Principal Component Analysis

CSC321 Lecture 20: Autoencoders

Clustering VS Classification

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

L11: Pattern recognition principles

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Algorithmisches Lernen/Machine Learning

Machine Learning, Midterm Exam

CS6220: DATA MINING TECHNIQUES

Introduction to Machine Learning Midterm Exam

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Introduction to Machine Learning Midterm, Tues April 8

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Introduction to Graphical Models

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning Lecture 5

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Unsupervised Learning

Logistic Regression. Machine Learning Fall 2018

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

PATTERN CLASSIFICATION

CS145: INTRODUCTION TO DATA MINING

ECE 5984: Introduction to Machine Learning

Switch Mechanism Diagnosis using a Pattern Recognition Approach

STA 414/2104: Lecture 8

Be able to define the following terms and answer basic questions about them:

Advanced Introduction to Machine Learning

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Intelligent Systems Discriminative Learning, Neural Networks

Machine Learning Techniques for Computer Vision

Brief Introduction of Machine Learning Techniques for Content Analysis

Expectation Maximization Algorithm

Probabilistic Reasoning in Deep Learning

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

STA 414/2104: Lecture 8

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Final Exam, Machine Learning, Spring 2009

Final Exam, Spring 2006

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Generative Clustering, Topic Modeling, & Bayesian Inference

Introduction to Machine Learning Midterm Exam Solutions

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Expectation maximization

Support Vector Machines

An Introduction to Statistical and Probabilistic Linear Models

PCA and admixture models

CSCI-567: Machine Learning (Spring 2019)

Principal Component Analysis

Latent Variable Models and EM Algorithm

Gaussian Mixture Models

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

CS281 Section 4: Factor Analysis and PCA

The connection of dropout and Bayesian statistics

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

PCA, Kernel PCA, ICA

Neutron inverse kinetics via Gaussian Processes

Gaussian Mixture Models, Expectation Maximization

Bayesian Networks Inference with Probabilistic Graphical Models

CSC 411: Lecture 09: Naive Bayes

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

CS181 Midterm 2 Practice Solutions

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

10-701/ Machine Learning - Midterm Exam, Fall 2010

Lecture 2: Simple Classifiers

Midterm: CS 6375 Spring 2015 Solutions

CSE 546 Final Exam, Autumn 2013

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Deep Learning Autoencoder Models

Naïve Bayes classification

CS249: ADVANCED DATA MINING

Mining Classification Knowledge

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Expectation-Maximization

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Machine Learning Gaussian Naïve Bayes Big Picture

Probabilistic Time Series Classification

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2016

Principal Component Analysis (PCA)

Transcription:

CSC411: Final Review James Lucas & David Madras December 3, 2018

Agenda 1. A brief overview 2. Some sample questions

Basic ML Terminology The final exam will be on the entire course; however, it will be more heavily weighted towards post-midterm material. For pre-midterm material, refer to the midterm review slides on the course website. Feed-forward Neural Network (NN) Activation Function Backpropagation Fully-connected vs. convolutional NN Dimensionality Reduction Principal Component Analysis (PCA) Autoencoder Generative vs. Discriminative Classifiers Naive Bayes Bayesian parameter estimation Prior/posterior distributions Gaussian Discriminant Analysis (GDA)

Basic ML Terminology The final exam will be on the entire course; however, it will be more heavily weighted towards post-midterm material. For pre-midterm material, refer to the midterm review slides on the course website. K-Means (hard and soft) Latent variable/factor models Clustering Gaussian Mixture Model (GMM) Expectation-Maximization (EM) algorithm Jensen s Inequality Matrix factorization Matrix completion Gaussian Processes Kernel trick Reinforcement learning States/actions/rewards Exploration/exploitation

Some Questions Question 1 True or False: 1. PCA always uses an invertible linear map 2. K-Means will always find the global minimum 3. Naive Bayes assumes that all features are independent

Some Questions Question 1 True or False: 1. PCA always uses an invertible linear map False 2. K-Means will always find the global minimum False 3. Naive Bayes assumes that all features are independent False

Some Questions Question 1 True or False: 1. PCA always uses an invertible linear map False 2. K-Means will always find the global minimum False 3. Naive Bayes assumes that all features are independent False Question 2 1. How can a generative model p(x y) be used as a classifier?

Some Questions Question 1 True or False: 1. PCA always uses an invertible linear map False 2. K-Means will always find the global minimum False 3. Naive Bayes assumes that all features are independent False Question 2 1. How can a generative model p(x y) be used as a classifier? 2. Give one advantage of Bayesian linear regression over ML linear regression. Give a disadvantage.

Subject Areas 1. Neural Networks 2. PCA 3. Probabalistic Models 4. Latent Variable Models 5. Bayesian Learning 6. Reinforcement Learning

Neural Networks 1. Weights and neurons 2. Activation functions 3. Depth and expressive power 4. Backpropagation

Convolutional Neural Networks 1. Convolutional neural network (CNN) architecture 2. Local connections/convolutions/pooling 3. Feature learning

Principal Component Analysis (PCA) 1. Dimensionality reduction 2. Linear subspaces 3. Spectral decomposition 4. Autoencoders

Probabilistic Models 1. Maximium Likelihood Estimation (MLE) 2. Generative vs. discriminative classification

Probabilistic Models (continued) 1. Bayesian parameter estimation 2. Choosing priors 3. Maximium A Posteriori (MAP) Estimation 4. Gaussian Discriminant Analysis 5. Decision Boundary

K-Means 1. Latent-variable models for clustering 2. Initialization, assigment, refitting 3. Convergence 4. Soft vs. hard K-means

Expectation-Maximization (EM) 1. Gaussian Mixture Model (GMM) 2. E-Step, M-Step 3. GMM vs. K-Means 4. Autoencoders

Matrix Factorization 1. Rank-k approximation 2. Matrix completion (movie recommendations) 3. Latent factor models 4. Alternating Least Squares (EM) 5. K-Means, Sparse Coding

Bayesian Linear Regression 1. Posterior distribution over the parameters 2. Bayesian decision theory 3. Bayesian optimization

Gaussian Processes 1. Distribution over functions! 2. Every point has a Gaussian distribution 3. Kernel functions

Reinforcement Learning 1. Choosing actions to maximize long-term reward 2. States, actions, rewards, policies 3. Value function and value iteration 4. Batch vs. Online RL 5. Exploration vs. Exploitation

Sample Question 1 Consider a 2-layer neural network, f, with 10-100-100 units in each layer respectively. We denote the weights of the network as W (1) and W (2). a) What are the dimensions of W (1) and W (2)? How many trainable parameters are in the neural network (ignoring biases)? We will now replace the weights of f with a simple Hypernetwork. The Hypernetwork, h, will be a two layer network with 10 input units, 10 hidden units, and K output units where K is equal to the total number of trainable parameters in f. In each forward pass, the output of h will be reshaped and used as the weights of f. b) How many parameters does h have (ignoring biases)? c) How might we change the output layer to reduce the number of parameters? State how many trainable parameters h has with your suggested method. (HINT: use matrix factorization)

Q1 Solution a) W (1) R 10 100, W (1) R 100 100. Total parameters: 10 100 + 100 100 = 11000. b) Total parameters: 10 10 + 10 11000 = 110100. c) Output low rank approximations to each weight matrix. Instead of outputting W (l), output U (l) and V (l) such that W (l) U (l) V (l). For example: U (1) R 10 2 V (1) R 2 100 U (2) R 100 2 V (2) R 2 100 Now the total number of parameters is: 10 10 + 10 (2 10 + 2 100 + 100 2 + 2 100) = 6300

Quick interlude: Hypernetworks This isn t quite how Hypernetworks typically work... See Ha et al. 2016 for details

Sample Question 2 a) State what conditions a function k : R n R n R must satisfy to be a valid kernel function. b) Prove that a symmetric matrix K R d d is positive semidefinite if and only if for all vectors c R d we have c T Kc 0.

Q2 Solution a) Its Gram matrix, given by K ij = k(x i, x j ) must be positive semidefinite for any choices of x 1,..., x d. b) First : If K is PSD then there exists an orthonormal basis of eigenvectors v 1,..., v d with non-negative eigenvalues λ 1,..., λ d. We can write any vector c in this basis: c = d i=1 a iv i. Then, d d c T Kc = ( a i v i ) T K( a i v i ) = i=1 i=1 d a i a j vi T Kv j = i=1 d a i a j vi T λ j v j i=1 As each of the v s are orthonormal, this sum is equal to d i=1 a2 i λ i 0. For : Pick c = v, some eigenvector. Then vkv = λv T v 0 λ 0.